Stravoris Datasets · Preclinical Bundle

Curated Medical MCQ
Datasets for AI Training

58,000 schema-standardized, Bloom's-tagged, IWF-curated medical questions across 6 preclinical subjects. Built for SLM fine-tuning.

Browse Free Sample View Pricing

58k

questions

subjects

329

topics

flaws checked

Built for Production Fine-Tuning

Six properties that distinguish this dataset from scraped MCQ dumps — engineered choices, not happy accidents.

Schema-standardized

One schema across all subjects. Write one data loader, use everywhere.

Bloom's taxonomy tagged

Control reasoning-depth in your training mix: recall, comprehension, application, analysis.

IWF-curated

Every question checked against 24 published item-writing flaws. No 'all of the above', no negative stems, no length cues.

Syllabus-aligned

329 topics mapped to the NMC/CBME preclinical medical curriculum.

Provenance tracked

Source, curation decision, flaw tags, and generation metadata for every question.

Gap-filled

Synthetic questions target under-covered curriculum areas. Source field distinguishes human-authored from AI-generated.

6 Subjects, 58,000 Questions

Bloom's distribution per subject. Higher-order combines comprehension, application, and analysis.

Subject	Questions	Topics	Recall	Higher-order
Anatomy	12,057	56	67%	33%
Biochemistry	7,120	31	76%	24%
Cell Biology & Histology	7,720	54	49%	51%
Physiology	8,975	63	60%	40%
Pathology	13,032	75	64%	36%
Microbiology	9,092	50	73%	27%

Consistent Schema Across All Subjects

One column layout. One data loader. Drop in any subject and training pipelines keep working — no per-file shims, no special casing.

Full schema reference

Columns

id, subject, source, syllabus_topic, question, options (list[str], always 4),
correct_index (0-3), explanation, blooms_level,
+ provenance: raw_topic, model_version, generation_date, curation_decision, curation_flaws, curator_notes

Load it

import pandas as pd
df = pd.read_parquet("anatomy/anatomy.parquet")
# Same schema for every subject — write once, load all

Pricing

Flat pricing. Commercial license included. No seat counts, no training-run caps.

View license terms →

Single Subject

$149

› One subject — parquet + JSONL + syllabus + README
› Commercial license for AI/ML training

Browse Subjects

Recommended

Preclinical Bundle

$599

› All 6 subjects — 57,996 questions
› Save $295 vs buying individually
› Includes schema reference + LICENSE

Buy Bundle

Enterprise

$2,999

› Full bundle + redistribution rights
› Embed in commercial products and pipelines

Buy Enterprise

Or buy individual subjects

Evaluate Before You Buy

Browse 600 sample questions (100 per subject) on Hugging Face — inspect the schema, check the quality, load them in Python.

View Free Sample on Hugging Face

How We Built This

A 7-step pipeline. Every question routes through it.

01Acquire
Source open medical MCQs and reference texts as the seed pool.
02Classify
Map every question to a normalized syllabus topic and Bloom's level.
03Gap Analysis
Identify under-covered topics and reasoning levels per subject.
04Generate
Author synthetic questions targeting gaps, tagged with model + generation date.
05Curate
Evaluate every question against 24 published item-writing flaws.
06Fix
Rewrite flawed items and re-validate, preserving educational intent.
07Finalize
Lock the schema, export parquet + JSONL, and ship a README per subject.

Questions?

Contact hello@stravoris.com

More subjects coming — Pharmacology, Genetics, Immunology, Embryology.

Curated Medical MCQDatasets for AI Training