Your HealthBench Scores Depend on Which Judge You Use

Why this matters

HealthBench (OpenAI, May 2025) is one of the strongest open benchmarks for medical AI conversation evaluation. It addresses limitations of saturated multiple-choice benchmarks (MedQA, USMLE) by using physician-authored rubrics and criterion-level scoring. It's been used by OpenAI for model launches, by startups like DR. INFO and Intelligent Internet for competitive claims, and by the UK AI Safety Institute's Inspect framework for standardized evaluation.

Most public HealthBench score reports we found use raw judge outputs, and many do not foreground the judge model next to the headline score. We did not find downstream score reports that calibrate benchmark means back to physician labels.

HealthBench also ships something rare: a meta-evaluation set with physician consensus labels for individual criteria. We evaluated all 29,511 criterion-level records in the inspect_evals implementation. This lets us directly audit the judge, and we haven't found a prior systematic cross-judge calibration audit of the released meta-eval data.

What we ran

Full judge audit on all 29,511 meta-eval records with two judges: gpt-4o-mini (the inspect_evals default) and claude-haiku-4-5-20251001
We modified the judge prompt to also request a 0–1 confidence score alongside the binary judgment (temperature 0.0). A prompt A/B check confirmed this does not materially change aggregate results (see Robustness section below).
Oracle-coverage sweeps (100/50/25/10/5%) using CJE^[1] (Causal Judge Evaluation), with deterministic subsampling (seed base 42, one seed)
2×2 ablation: judge score format (continuous vs binary) × oracle label format (binary majority vs continuous physician agreement)
Prompt A/B confound check (confidence-augmented vs original binary-only prompt)

The key technique behind the coverage sweeps is CJE. It works by learning the systematic relationship between judge scores and physician labels on a small labeled sample (the “oracle”), then using that mapping to correct the judge's scores on all remaining records. “Oracle coverage” is the fraction of records with physician labels available for this correction—at 5% coverage, only ~1,400 of the 29,511 records need physician labels, and CJE corrects the rest from that small sample.

Finding 1: Cross-judge category disagreement is massive

Same 29,511 records, same physician ground truth, two different judges. The “Ground Truth” column shows the physician majority-vote satisfaction rate for each category. Here are the categories with the largest cross-judge divergence:

Criteria-met rates by category. Each group shows the same records scored by physicians (ground truth), gpt-4o-mini, and Haiku 4.5. Categories selected for largest cross-judge divergence.

Category	N	Ground Truth	gpt-4o-mini	Haiku 4.5	Cross-Judge Gap
hedging / no-uncertainty / seeks context	910	86.5%	0.4%	73.6%	73.2 pp
hedging / irreducible / seeks context	792	84.8%	0.9%	65.8%	64.9 pp
emergency / non-emergent / context seeking	536	67.0%	98.1%	49.4%	48.7 pp
complex / detailed / accuracy	644	61.3%	8.7%	40.9%	32.2 pp
complex / simple / accuracy	652	77.3%	33.7%	61.8%	28.1 pp
complex / detailed / appropriate	644	64.1%	81.8%	55.8%	26.0 pp

On hedging/seeks-context criteria, gpt-4o-mini is nearly reversed relative to physicians (0.4% vs 86.5%) while Haiku is merely low (73.6%). On emergency referrals / non-emergent / context seeking, the judges disagree in opposite directions: gpt-4o-mini overestimates by +31 pp, Haiku underestimates by −18 pp.

Qualifying the aggregate-level documentation

The inspect_evals HealthBench README states: “Three judge models were tried on the meta eval (GPT-5-nano, GPT-4-mini, Claude Haiku 4.5) and no statistically significant difference was observed, so we kept gpt-4o-mini as the judge model despite it being outdated.” Our results are compatible with that aggregate statement, but they show that aggregate judge equivalence can still hide severe category-level divergence.

What HealthBench already validates, and what this adds

HealthBench already includes a trustworthiness check: it compares model-based grading to physician grading on consensus criteria and reports agreement metrics (including Macro F1) against physician baselines. That is an important validation step and we do not dispute it.

Our contribution is different. We analyze calibration and category-level behavior under two concrete judges used in practice, then quantify how much oracle labeling is needed to correct aggregate estimates. Agreement can look acceptable in aggregate while category-level errors remain large enough to change conclusions about model behavior.

In short: HealthBench asks whether judge labels broadly track physicians; this audit asks whether reported benchmark scores and category conclusions are numerically reliable for decision-making.

Finding 2: Both judges are overconfident, but by different amounts

Metric	gpt-4o-mini	Haiku 4.5
Mean confidence	0.916	0.801
Physician agreement rate	67.1%	67.1%
Calibration shift	−24.5 pp	−13.0 pp
Binary accuracy	68.3%	71.9%
Binary F1	0.773	0.792

Both judges are systematically overconfident. gpt-4o-mini's confidence clusters near 1.0 (mean 0.916); Haiku uses more of the range (mean 0.801, min 0.000, max 0.990). But both overshoot the physician ground truth substantially.

If you only look at binary accuracy and F1, the judges look comparable (~68–72% accuracy, ~0.77–0.79 F1). The calibration shift, visible only through the continuous confidence scores, reveals how different the underlying behavior is.

Finding 3: Calibration converges. The ground truth is recoverable

After CJE calibration^[1], both judges converge to the same estimate (~0.671) despite very different raw scores:

Oracle Coverage	gpt-4o-mini		Haiku 4.5
Oracle Coverage	Estimate	95% CI	Estimate	95% CI
100%	0.6711	[0.666, 0.678]	0.6712	[0.666, 0.677]
10%	0.6828	[0.667, 0.700]	0.6770	[0.661, 0.695]
5%	0.6918	[0.669, 0.716]	0.6852	[0.662, 0.712]

At 5% oracle coverage (~1,400 physician labels), the calibrated estimate remains in the same range as the full-data answer (within about 0.6–2.1 pp in this deterministic sample). The larger raw miscalibration of gpt-4o-mini does not prevent recovery.

The practical result

You don't need to relabel everything. A small oracle sample is enough to detect and correct large calibration errors. For a benchmark that already ships 29,511 physician labels, this is trivial. The labels exist, they're just not being used for calibration.

Finding 4: Oracle definition matters more than judge format

Judge format (continuous confidence vs binary met/not-met) makes no difference after calibration. But how you define physician ground truth shifts the calibrated estimate by 10.1 pp—from 0.671 (binary majority) to 0.772 (continuous agreement fraction). This is not a calibration error; it's a measurement decision about whether to preserve physician disagreement information.

Show full 2×2 ablation results

We ran a 2×2 ablation on both judges: judge score format (continuous confidence vs binary met/not-met) × oracle label format (binary physician majority vs continuous physician agreement rate).

Condition	Judge Format	Oracle Format	gpt-4o-mini	Haiku
A	continuous	binary majority	0.6711	0.6711
B	binary	binary majority	0.6711	0.6711
C	continuous	continuous agreement	0.7717	0.7718
D	binary	continuous agreement	0.7717	0.7718

Judge format doesn't matter. Continuous vs binary judge scores produce identical calibrated estimates (A = B, C = D). Even though Haiku uses more of the confidence range, it doesn't help.

Oracle definition shifts the estimand by +10.1 pp. Switching from binary physician majority to continuous agreement fraction changes what you're estimating: a score of 3/5 physicians agreeing counts as 0.6 instead of 1.0. Benchmark users should be explicit about which target they're using.

Robustness: Prompt A/B confound

We modified the judge prompt to request confidence scores (which the original HealthBench prompt does not). This changes ~5.1% of individual binary decisions, but aggregate quality metrics are essentially unchanged (accuracy delta: −0.0004, F1 delta: −0.0041). The aggregate overconfidence signal is not an artifact of our prompt modification, though category-level sensitivity still exists.

What this means for benchmark users

Category-level scores are not trustworthy by default

On hedging and context-seeking criteria, the default judge's scores are functionally random or reversed relative to physicians. Without category-specific validation or calibration, a model that genuinely excels at appropriate hedging can be penalized by gpt-4o-mini and, to a lesser extent, by Haiku.

Aggregate scores hide the problem

Both judges achieve ~68–72% binary accuracy overall because category-level overestimation and underestimation cancel out. The aggregate number looks acceptable while individual subscores are severely wrong.

Aggregate judge equivalence can hide instability

Two judges can look similar on overall metrics while disagreeing by up to 73 pp on specific criterion families. Aggregate agreement does not guarantee judge-invariant category conclusions.

Aggregate calibration is cheap

5% oracle coverage (~1,400 labels) recovers the aggregate physician-aligned mean. For a benchmark that already ships 29,511 physician labels, this is operationally cheap. That does not repair broken category-level subscores.

How you define ground truth matters

Binary physician majority vs continuous agreement rate shifts the calibrated estimate by 10.1 pp. Benchmark users should be explicit about which oracle they're targeting.

Limitations

We audited two judges. A broader sweep (GPT-4.1, GPT-5, Sonnet 4.6) would strengthen the cross-judge findings.
Category-level CJE calibration (per-category sweeps) is not yet implemented. We calibrate at the aggregate level only.
The physician labels themselves have imperfect inter-rater agreement. The “ground truth” is noisy.
We used the meta-eval subset (29,511 records). The full HealthBench has ~48,000 criteria.
CJE calibration corrects aggregate estimates but does not fix the underlying category-level judge failures. A calibrated aggregate score still averages over broken subscores.

Method

Data: HealthBench meta-eval^[2] (inspect_evals implementation), N=29,511 criterion-level records with physician consensus labels.
Judges: gpt-4o-mini (inspect_evals default), claude-haiku-4-5-20251001. Temperature 0.0. Both used a confidence-augmented prompt. A prompt A/B check confirmed this modification flips ~5.1% of decisions but does not materially change aggregate metrics.
Parse success: gpt-4o-mini 29,510/29,511 (99.997%); Haiku 29,501/29,511 (99.97%). Failed records excluded.
Calibration: CJE mean-preserving isotonic calibration with auto inference mode (which selected clustered bootstrap inference on this dataset), n_bootstrap=500, and deterministic subsampling (seed base 42, one seed) for reduced-oracle analyses^[1].
Code: Analysis scripts and pre-cached judge outputs are in the reproducibility repo. A general-purpose judge calibration tool is in a merged PR to inspect_evals.

Reproducing this analysis

The analysis scripts, pre-cached judge outputs, and pre-computed results are in the reproducibility repo. You can verify the key numbers without API keys by running CJE calibration on the cached data. If your package index does not yet have the current CJE release, install it directly from source:

pip install "cje-eval @ git+https://github.com/cimo-labs/cje.git"

The exact commands and flags used for the numbers reported here are documented in the reproducibility repo. Given any set of judge scores and a small sample of oracle (human) labels, CJE produces calibrated estimates with valid confidence intervals. See the paper or the documentation for details.

References

[1] Landesberg, E. (2024). “Causal Judge Evaluation: Unbiased Off-Policy Evaluation for LLM Judges.” arXiv:2512.11150

[2] OpenAI (2025). “HealthBench: A Benchmark for Medical AI Conversations.” openai.com/index/healthbench