MedGym — A simulated hospital for medical AI agents

A simulated hospital for medical AI agents.

Run an episode. Watch the agent decide. Score it.

For frontier labs, medical AI startups, and clinician partners. or hi@medgym.ai

Opening your email client — finish sending to complete.

// 01 simulation episode ep_0fa72a32ec1d fixture adv_cocaine_dissection seed 42 red_herring live

// model spread · 7 models on adv_cocaine_dissection · scored against expert answer key · hypothetical

// patient state

34M, chest pain × 2h, cocaine use 1h prior.

HR 118↑ from 118

BP 168/96⚠ PP widening

SpO₂ 96↓ from 96

// alerts

⚠ widening pulse pressure (120 → 188/68)

⚠ patient reports tearing back pain

// trajectory stream · env ↔ agent

// orders & results

labs & imaging

medications

disposition

// 02 scoring trajectory ep_0fa72a32ec1d dimensions 9 awaiting

// per-dimension decomposition

survival +30.0

disposition 0.0

workup +8.0

red-flag −12.5

communication +4.0

cost +5.0

harm 0.0

efficiency +2.5

documentation +0.5

composite reward

+0.0

deterministic · seed 42 · reproducible

outcome · missed_aad harm · high

// 03 counterfactual replay same fixture · same seed · alt action @ step.13 awaiting

// actual run

agent's choice

reassure · no imaging

outcome · missed_aad

+34.5

composite reward

// counterfactual

if instead

imaging · CTA chest

outcome · caught_aad

+0.0

composite reward

Δ +0.0 the verifiable cost of the missed call. Same patient, replayable, scored. GRPO signal.

// 04 training data trajectory ep_0fa72a32ec1d corpus wave1b awaiting

The trajectory becomes training data.

One encounter → three corpora: verifiable rewards, preference pairs, expert demonstrations.

// rlvr 28 tuples

Verifiable rewards

Tuples scored against the answer key.

corpus/wave1b/rlvr/ep_0fa72a32ec1d.jsonl

// grpo 3 pairs

Group preference pairs

Chosen vs rejected at each decision, same seed.

corpus/wave1b/grpo/ep_0fa72a32ec1d.jsonl

// sft 1 correction

Expert demonstration

Physician's correction at the critical step.

corpus/wave1b/sft/ep_0fa72a32ec1d.jsonl

Real cohorts. Expert answer keys.

Real cohorts, graded by practicing clinicians. Every patient, every score traceable to a verified clinical source.

// fixture review template wave 1b sample

adv_cocaine_dissection target sign-off

Lead Board-certified · Emergency Medicine Co-sign Board-certified · Cardiology

≥ 0.90 wave 1b

adv_aaa_30min target sign-off

Lead Board-certified · Emergency Medicine Co-sign Board-certified · Vascular Surgery

≥ 0.90 wave 1b

adv_pediatric_dka target sign-off

Lead Board-certified · Emergency Medicine Co-sign Board-certified · Pediatrics

≥ 0.90 wave 1b

// partner network

target partner sites · 4 public corpora

// target IAA wave 1b

≥ 0.90

target inter-rater agreement

↗ rising every release · target curve

// specialty panel 9 specialties · ~30 reviewers

Cards

Neuro

Peds

Vasc

SCC

Tox

EMEmergency Medicine CardsCardiology NeuroNeurology PedsPediatrics VascVascular Surgery SCCSurgical Critical Care IMInternal Medicine OBObstetrics ToxToxicology

HIPAA-compliant de-identification reviewer pipeline audited quarterly raw review logs available under partner DUA

Why MedGym exists.

Medical AI has had benchmarks for years. None of them test decision-making in a clinical setting. None of them simulate how medicine is actually practiced.

// medqa · q1247

35M with chest pain, cocaine 1h prior. Most likely dx:

A GERD

B ACS

C Aortic dissection

D Pulmonary embolism

1 question · 1 correct answer · static

vs.

// medgym · ep_0fa72a32ec1d

└─ start
   ├─ order ECG, troponin
   │   └─ trop neg, sinus tach
   ├─ interview HPI
   │   └─ "cocaine 1h prior"
   ├─ admit chest_pain_obs
   └─ ✗ CTA chest · never ordered

28 actions · stateful · scored end-to-end

// 03.1 · the gap

Benchmarks measure recall, not deployment.

USMLE-style multiple choice tests pattern matching on isolated questions. Real encounters are sequential. Every call shapes what comes next. Models that ace medical Q&A still chase the wrong clues and miss the sickest patients.

MedGym scores the full episode, not the final answer. Order timing, workup completeness, disposition appropriateness, and the calls the model didn't make are all part of the grade.

// reward(trajectory)

+30.0 survival

+8.0 workup completeness

+5.0 cost efficiency

+4.0 communication

−12.5 red-flag missed

+34.5 composite

// 03.2 · the primitive

Verifiable rewards on full trajectories.

Coding agents broke through when the reward signal got better: PR merged, tests pass. MedGym brings that primitive to medicine. A stateful patient, programmatic scoring across nine outcome dimensions, deterministic replay. The reward is just math.

// 03.3 · the output

Training data, at the scale of medicine.

Every episode yields RLVR tuples, GRPO preference pairs, and SFT corrections. Open source, replayable, scored against physician-authored answer keys. The corpus frontier labs need to drive medical agents toward deployment.

What we have. What's coming.

Environment is live. Public benchmark and whitepaper land in 2026.

now

A full emergency department, simulated. An encounter library across all acuity tiers: red herrings, atypical presentations, pediatric and geriatric edges. Realistic vitals, labs, demographics. Hidden diagnoses drawn from real ED misses. Apache 2.0.

v0.4 live

q3 2026

Public leaderboard. Frontier and open-weights models evaluated end to end. Composite reward, per-dimension breakdowns, per-model audit reports.

scheduled

late 2026

Methodology whitepaper. arXiv preprint on environment design, the rubric, and reproducibility. Community-contributed encounters open with the paper.

scheduled

// editor's note MedGym is pre-launch. The environment, harness, and answer keys are being staged for release. If you're training or evaluating a medical agent, bring us your model. c.c. · 2026-05-11