v0.4

A simulated hospital for medical AI agents.

Run an episode. Watch the agent decide. Score it.

See a run
scenarios
10,000+
archetypes
50+
difficulty tiers
100
deterministic trajectories
2256
// 01 simulation episode ep_0fa72a32ec1d fixture adv_cocaine_dissection seed 42 red_herring live

// model spread · 7 models on adv_cocaine_dissection · scored against expert answer key · hypothetical

REWARD composite +85 +40 0 expert +85 claude +48 gpt-5 +40 deepseek +34.5 gemini +25 llama +18 qwen +6 mistral −5 step.01 step.04 step.09 step.13 ⚠ step.22 step.28 0 min 10 min 25 min 32 min 55 min 70 min

// patient state

34M, chest pain × 2h, cocaine use 1h prior.

HR 118 from 118
BP 168/96 PP widening
SpO₂ 96 from 96

// alerts

⚠ widening pulse pressure (120 → 188/68)
⚠ patient reports tearing back pain

// trajectory stream · env ↔ agent

// orders & results

labs & imaging

medications

disposition

// 02 scoring trajectory ep_0fa72a32ec1d dimensions 9 awaiting

// per-dimension decomposition

survival +30.0
disposition 0.0
workup +8.0
red-flag −12.5
communication +4.0
cost +5.0
harm 0.0
efficiency +2.5
documentation +0.5
composite reward
+0.0
deterministic · seed 42 · reproducible
outcome · missed_aad harm · high
// 03 counterfactual replay same fixture · same seed · alt action @ step.13 awaiting

// actual run

agent's choice

reassure · no imaging

outcome · missed_aad

+34.5
composite reward
vs

// counterfactual

if instead

imaging · CTA chest

outcome · caught_aad

+0.0
composite reward
Δ +0.0 the verifiable cost of the missed call. Same patient, replayable, scored. GRPO signal.
// 04 training data trajectory ep_0fa72a32ec1d corpus wave1b awaiting

The trajectory becomes training data.

One encounter three corpora: verifiable rewards, preference pairs, expert demonstrations.

// rlvr 28 tuples

Verifiable rewards

Tuples scored against the answer key.


            
corpus/wave1b/rlvr/ep_0fa72a32ec1d.jsonl
// grpo 3 pairs

Group preference pairs

Chosen vs rejected at each decision, same seed.


            
corpus/wave1b/grpo/ep_0fa72a32ec1d.jsonl
// sft 1 correction

Expert demonstration

Physician's correction at the critical step.


            
corpus/wave1b/sft/ep_0fa72a32ec1d.jsonl

Real cohorts. Expert answer keys.

Real cohorts, graded by practicing clinicians. Every patient, every score traceable to a verified clinical source.

// fixture review template wave 1b sample
adv_cocaine_dissection target sign-off
Lead Board-certified · Emergency Medicine Co-sign Board-certified · Cardiology
≥ 0.90 wave 1b
adv_aaa_30min target sign-off
Lead Board-certified · Emergency Medicine Co-sign Board-certified · Vascular Surgery
≥ 0.90 wave 1b
adv_pediatric_dka target sign-off
Lead Board-certified · Emergency Medicine Co-sign Board-certified · Pediatrics
≥ 0.90 wave 1b
// partner network
14
target partner sites · 4 public corpora
// target IAA wave 1b
≥ 0.90
target inter-rater agreement
rising every release · target curve
// specialty panel 9 specialties · ~30 reviewers
EM
Cards
Neuro
Peds
Vasc
SCC
IM
OB
Tox
EMEmergency Medicine CardsCardiology NeuroNeurology PedsPediatrics VascVascular Surgery SCCSurgical Critical Care IMInternal Medicine OBObstetrics ToxToxicology

HIPAA-compliant de-identification reviewer pipeline audited quarterly raw review logs available under partner DUA

Why MedGym exists.

Medical AI has had benchmarks for years. None of them test decision-making in a clinical setting. None of them simulate how medicine is actually practiced.

// medqa · q1247
35M with chest pain, cocaine 1h prior. Most likely dx:
A GERD
B ACS
C Aortic dissection
D Pulmonary embolism
1 question · 1 correct answer · static
vs.
// medgym · ep_0fa72a32ec1d
└─ start
   ├─ order ECG, troponin
   │   └─ trop neg, sinus tach
   ├─ interview HPI
   │   └─ "cocaine 1h prior"
   ├─ admit chest_pain_obs
   └─ ✗ CTA chest · never ordered
28 actions · stateful · scored end-to-end

// 03.1 · the gap

Benchmarks measure recall, not deployment.

USMLE-style multiple choice tests pattern matching on isolated questions. Real encounters are sequential. Every call shapes what comes next. Models that ace medical Q&A still chase the wrong clues and miss the sickest patients.

MedGym scores the full episode, not the final answer. Order timing, workup completeness, disposition appropriateness, and the calls the model didn't make are all part of the grade.

// reward(trajectory)
+30.0 survival
+8.0 workup completeness
+5.0 cost efficiency
+4.0 communication
−12.5 red-flag missed
+34.5 composite

// 03.2 · the primitive

Verifiable rewards on full trajectories.

Coding agents broke through when the reward signal got better: PR merged, tests pass. MedGym brings that primitive to medicine. A stateful patient, programmatic scoring across nine outcome dimensions, deterministic replay. The reward is just math.

// 03.3 · the output

Training data, at the scale of medicine.

Every episode yields RLVR tuples, GRPO preference pairs, and SFT corrections. Open source, replayable, scored against physician-authored answer keys. The corpus frontier labs need to drive medical agents toward deployment.

What we have. What's coming.

Environment is live. Public benchmark and whitepaper land in 2026.

// editor's note MedGym is pre-launch. The environment, harness, and answer keys are being staged for release. If you're training or evaluating a medical agent, bring us your model. c.c. · 2026-05-11