Stravoris
Stravoris Datasets · Preclinical Bundle

Curated Medical MCQ
Datasets for AI Training

58,000 schema-standardized, Bloom's-tagged, IWF-curated medical questions across 6 preclinical subjects. Built for SLM fine-tuning.

58k
questions
6
subjects
329
topics
24
flaws checked

Built for Production Fine-Tuning

Six properties that distinguish this dataset from scraped MCQ dumps — engineered choices, not happy accidents.

01

Schema-standardized

One schema across all subjects. Write one data loader, use everywhere.

02

Bloom's taxonomy tagged

Control reasoning-depth in your training mix: recall, comprehension, application, analysis.

03

IWF-curated

Every question checked against 24 published item-writing flaws. No 'all of the above', no negative stems, no length cues.

04

Syllabus-aligned

329 topics mapped to the NMC/CBME preclinical medical curriculum.

05

Provenance tracked

Source, curation decision, flaw tags, and generation metadata for every question.

06

Gap-filled

Synthetic questions target under-covered curriculum areas. Source field distinguishes human-authored from AI-generated.

6 Subjects, 58,000 Questions

Bloom's distribution per subject. Higher-order combines comprehension, application, and analysis.

SubjectQuestionsTopicsRecallHigher-order
Anatomy12,0575667%33%
Biochemistry7,1203176%24%
Cell Biology & Histology7,7205449%51%
Physiology8,9756360%40%
Pathology13,0327564%36%
Microbiology9,0925073%27%

Consistent Schema Across All Subjects

One column layout. One data loader. Drop in any subject and training pipelines keep working — no per-file shims, no special casing.

Full schema reference
Columns
id, subject, source, syllabus_topic, question, options (list[str], always 4),
correct_index (0-3), explanation, blooms_level,
+ provenance: raw_topic, model_version, generation_date, curation_decision, curation_flaws, curator_notes
Load it
import pandas as pd
df = pd.read_parquet("anatomy/anatomy.parquet")
# Same schema for every subject — write once, load all

Pricing

Flat pricing. Commercial license included. No seat counts, no training-run caps.

View license terms →
Single Subject
$149
  • One subject — parquet + JSONL + syllabus + README
  • Commercial license for AI/ML training
Browse Subjects
Recommended
Preclinical Bundle
$599
  • All 6 subjects — 57,996 questions
  • Save $295 vs buying individually
  • Includes schema reference + LICENSE
Buy Bundle
Enterprise
$2,999
  • Full bundle + redistribution rights
  • Embed in commercial products and pipelines
Buy Enterprise

Evaluate Before You Buy

Browse 600 sample questions (100 per subject) on Hugging Face — inspect the schema, check the quality, load them in Python.

View Free Sample on Hugging Face

How We Built This

A 7-step pipeline. Every question routes through it.

  1. 01Acquire

    Source open medical MCQs and reference texts as the seed pool.

  2. 02Classify

    Map every question to a normalized syllabus topic and Bloom's level.

  3. 03Gap Analysis

    Identify under-covered topics and reasoning levels per subject.

  4. 04Generate

    Author synthetic questions targeting gaps, tagged with model + generation date.

  5. 05Curate

    Evaluate every question against 24 published item-writing flaws.

  6. 06Fix

    Rewrite flawed items and re-validate, preserving educational intent.

  7. 07Finalize

    Lock the schema, export parquet + JSONL, and ship a README per subject.

Questions?

Contact hello@stravoris.com

More subjects coming — Pharmacology, Genetics, Immunology, Embryology.