Curated Medical MCQ
Datasets for AI Training
58,000 schema-standardized, Bloom's-tagged, IWF-curated medical questions across 6 preclinical subjects. Built for SLM fine-tuning.
Built for Production Fine-Tuning
Six properties that distinguish this dataset from scraped MCQ dumps — engineered choices, not happy accidents.
Schema-standardized
One schema across all subjects. Write one data loader, use everywhere.
Bloom's taxonomy tagged
Control reasoning-depth in your training mix: recall, comprehension, application, analysis.
IWF-curated
Every question checked against 24 published item-writing flaws. No 'all of the above', no negative stems, no length cues.
Syllabus-aligned
329 topics mapped to the NMC/CBME preclinical medical curriculum.
Provenance tracked
Source, curation decision, flaw tags, and generation metadata for every question.
Gap-filled
Synthetic questions target under-covered curriculum areas. Source field distinguishes human-authored from AI-generated.
6 Subjects, 58,000 Questions
Bloom's distribution per subject. Higher-order combines comprehension, application, and analysis.
| Subject | Questions | Topics | Recall | Higher-order |
|---|---|---|---|---|
| Anatomy | 12,057 | 56 | 67% | 33% |
| Biochemistry | 7,120 | 31 | 76% | 24% |
| Cell Biology & Histology | 7,720 | 54 | 49% | 51% |
| Physiology | 8,975 | 63 | 60% | 40% |
| Pathology | 13,032 | 75 | 64% | 36% |
| Microbiology | 9,092 | 50 | 73% | 27% |
Consistent Schema Across All Subjects
One column layout. One data loader. Drop in any subject and training pipelines keep working — no per-file shims, no special casing.
Full schema referenceid, subject, source, syllabus_topic, question, options (list[str], always 4),
correct_index (0-3), explanation, blooms_level,
+ provenance: raw_topic, model_version, generation_date, curation_decision, curation_flaws, curator_notesimport pandas as pd
df = pd.read_parquet("anatomy/anatomy.parquet")
# Same schema for every subject — write once, load allPricing
Flat pricing. Commercial license included. No seat counts, no training-run caps.
- › One subject — parquet + JSONL + syllabus + README
- › Commercial license for AI/ML training
- › All 6 subjects — 57,996 questions
- › Save $295 vs buying individually
- › Includes schema reference + LICENSE
- › Full bundle + redistribution rights
- › Embed in commercial products and pipelines
Evaluate Before You Buy
Browse 600 sample questions (100 per subject) on Hugging Face — inspect the schema, check the quality, load them in Python.
How We Built This
A 7-step pipeline. Every question routes through it.
- 01Acquire
Source open medical MCQs and reference texts as the seed pool.
- 02Classify
Map every question to a normalized syllabus topic and Bloom's level.
- 03Gap Analysis
Identify under-covered topics and reasoning levels per subject.
- 04Generate
Author synthetic questions targeting gaps, tagged with model + generation date.
- 05Curate
Evaluate every question against 24 published item-writing flaws.
- 06Fix
Rewrite flawed items and re-validate, preserving educational intent.
- 07Finalize
Lock the schema, export parquet + JSONL, and ship a README per subject.
Questions?
Contact hello@stravoris.com
More subjects coming — Pharmacology, Genetics, Immunology, Embryology.
