Reference

Dataset Schema Reference

All Stravoris medical MCQ datasets use this exact schema. Every subject, every row, every file - same columns, same types, same semantics.

Columns

Column	Type	Nullable	Description
`id`	`string`	No	Human-readable unique identifier (e.g., ANA-UG-uplimnera-0042)
`subject`	`string`	No	Subject name
`source`	`string`	No	"medmcqa" or "synthetic_YYYY-MM-DD"
`syllabus_topic`	`string`	No	Normalized curriculum topic
`question`	`string`	No	Question stem and lead-in
`options`	`list[string]`	No	Exactly 4 answer options
`correct_index`	`integer`	No	Index of correct option (0-3)
`explanation`	`string`	No	Reference explanation (pedagogical)
`training_explanation`	`string`	No	Training-optimized explanation (for SFT)
`blooms_level`	`string`	No	recall / comprehension / application / analysis

Loading Examples

pandas

import pandas as pd
df = pd.read_parquet("anatomy/anatomy.parquet")

HF datasets

from datasets import load_dataset
ds = load_dataset("stravoris/medical-mcq-dataset")

JSONL

import json
with open("anatomy/anatomy.jsonl") as f:
    for line in f:
        q = json.loads(line)
        break

All subjects at once

subjects = [
    "medicine",
    "surgery",
    "anatomy",
    "biochemistry",
    "cell-biology",
    "microbiology",
    "pathology",
    "pharmacology",
    "physiology",
    "anaesthesia",
    "dermatology",
    "ent",
    "genetics",
    "obstetrics-gynaecology",
    "ophthalmology",
    "orthopaedics",
    "paediatrics",
    "psychiatry",
    "radiology",
    "venereology",
]
full = pd.concat([pd.read_parquet(f"{s}/{s}.parquet") for s in subjects], ignore_index=True)

Key Properties

4 options per question, always
Curated against 24 published item-writing flaws
Bloom's taxonomy tagged
Syllabus-aligned
Provenance tracked

← Back to Datasets License terms →