Stravoris
Reference

Dataset Schema Reference

All Stravoris medical MCQ datasets use this exact schema. Every subject, every row, every file — same columns, same types, same semantics.

Columns

ColumnTypeNullableDescription
idstringNoUnique identifier
subjectstringNoSubject name
sourcestringNo"medmcqa" or "synthetic_YYYY-MM-DD"
raw_topicstringYesOriginal topic from source dataset
syllabus_topicstringNoNormalized curriculum topic
questionstringNoQuestion stem and lead-in
optionslist[string]NoExactly 4 answer options
correct_indexintegerNoIndex of correct option (0-3)
explanationstringYesExplanation of correct answer
blooms_levelstringNorecall / comprehension / application / analysis
model_versionstringYesModel that generated synthetic questions
generation_datestringYesISO date for synthetic questions
curation_decisionstringNo"accept" or "fix"
curation_flawslist[string]NoItem-writing flaw tags from curation
curator_notesstringYesFree-text curation notes

Loading Examples

pandas
import pandas as pd
df = pd.read_parquet("anatomy/anatomy.parquet")
HF datasets
from datasets import load_dataset
ds = load_dataset("parquet", data_files="anatomy/anatomy.parquet")
JSONL
import json
with open("anatomy/anatomy.jsonl") as f:
    for line in f:
        q = json.loads(line)
        break
All subjects at once
subjects = ["anatomy", "biochemistry", "cell-biology", "physiology", "pathology", "microbiology"]
full = pd.concat([pd.read_parquet(f"{s}/{s}.parquet") for s in subjects], ignore_index=True)

Key Properties

  • 4 options per question, always
  • Curated against 24 published item-writing flaws
  • Bloom's taxonomy tagged
  • Syllabus-aligned
  • Provenance tracked