Reference
Dataset Schema Reference
All Stravoris medical MCQ datasets use this exact schema. Every subject, every row, every file — same columns, same types, same semantics.
Columns
| Column | Type | Nullable | Description |
|---|---|---|---|
id | string | No | Unique identifier |
subject | string | No | Subject name |
source | string | No | "medmcqa" or "synthetic_YYYY-MM-DD" |
raw_topic | string | Yes | Original topic from source dataset |
syllabus_topic | string | No | Normalized curriculum topic |
question | string | No | Question stem and lead-in |
options | list[string] | No | Exactly 4 answer options |
correct_index | integer | No | Index of correct option (0-3) |
explanation | string | Yes | Explanation of correct answer |
blooms_level | string | No | recall / comprehension / application / analysis |
model_version | string | Yes | Model that generated synthetic questions |
generation_date | string | Yes | ISO date for synthetic questions |
curation_decision | string | No | "accept" or "fix" |
curation_flaws | list[string] | No | Item-writing flaw tags from curation |
curator_notes | string | Yes | Free-text curation notes |
Loading Examples
pandas
import pandas as pd
df = pd.read_parquet("anatomy/anatomy.parquet")HF datasets
from datasets import load_dataset
ds = load_dataset("parquet", data_files="anatomy/anatomy.parquet")JSONL
import json
with open("anatomy/anatomy.jsonl") as f:
for line in f:
q = json.loads(line)
breakAll subjects at once
subjects = ["anatomy", "biochemistry", "cell-biology", "physiology", "pathology", "microbiology"]
full = pd.concat([pd.read_parquet(f"{s}/{s}.parquet") for s in subjects], ignore_index=True)Key Properties
- 4 options per question, always
- Curated against 24 published item-writing flaws
- Bloom's taxonomy tagged
- Syllabus-aligned
- Provenance tracked
