Your HealthBench Scores Depend on Which Judge You Use
An audit of 29,511 physician-labeled records reveals massive cross-judge disagreement at the category level
TL;DR
- Two LLM judges (gpt-4o-mini, Claude Haiku 4.5) disagree by up to 73 percentage points on specific HealthBench criteria categories, and sometimes in opposite directions.
- Both judges are overconfident relative to physician ground truth: gpt-4o-mini by 24.5 pp, Haiku by 13.0 pp. Standard metrics (F1, accuracy) can still look acceptable while calibration and category-level failures remain.
- After aggregate calibration, both converge to the same physician-aligned mean (~0.671). With physician labels on just 5% of records (~1,400 labels), that aggregate estimate is recoverable, but broken category-level subscores remain broken.
Why this matters
HealthBench (OpenAI, May 2025) is one of the strongest open benchmarks for medical AI conversation evaluation. It addresses limitations of saturated multiple-choice benchmarks (MedQA, USMLE) by using physician-authored rubrics and criterion-level scoring. It's been used by OpenAI for model launches, by startups like DR. INFO and Intelligent Internet for competitive claims, and by the UK AI Safety Institute's Inspect framework for standardized evaluation.
Most public HealthBench score reports we found use raw judge outputs, and many do not foreground the judge model next to the headline score. We did not find downstream score reports that calibrate benchmark means back to physician labels.
HealthBench also ships something rare: a meta-evaluation set with physician consensus labels for individual criteria. We evaluated all 29,511 criterion-level records in the inspect_evals implementation. This lets us directly audit the judge, and we haven't found a prior systematic cross-judge calibration audit of the released meta-eval data.
What we ran
- Full judge audit on all 29,511 meta-eval records with two judges:
gpt-4o-mini(the inspect_evals default) andclaude-haiku-4-5-20251001 - We modified the judge prompt to also request a 0–1 confidence score alongside the binary judgment (temperature 0.0). A prompt A/B check confirmed this does not materially change aggregate results (see Robustness section below).
- Oracle-coverage sweeps (100/50/25/10/5%) using CJE[1] (Causal Judge Evaluation), with deterministic subsampling (seed base 42, one seed)
- 2×2 ablation: judge score format (continuous vs binary) × oracle label format (binary majority vs continuous physician agreement)
- Prompt A/B confound check (confidence-augmented vs original binary-only prompt)
The key technique behind the coverage sweeps is CJE. It works by learning the systematic relationship between judge scores and physician labels on a small labeled sample (the “oracle”), then using that mapping to correct the judge's scores on all remaining records. “Oracle coverage” is the fraction of records with physician labels available for this correction—at 5% coverage, only ~1,400 of the 29,511 records need physician labels, and CJE corrects the rest from that small sample.
Finding 1: Cross-judge category disagreement is massive
Same 29,511 records, same physician ground truth, two different judges. The “Ground Truth” column shows the physician majority-vote satisfaction rate for each category. Here are the categories with the largest cross-judge divergence:
| Category | N | Ground Truth | gpt-4o-mini | Haiku 4.5 | Cross-Judge Gap |
|---|---|---|---|---|---|
| hedging / no-uncertainty / seeks context | 910 | 86.5% | 0.4% | 73.6% | 73.2 pp |
| hedging / irreducible / seeks context | 792 | 84.8% | 0.9% | 65.8% | 64.9 pp |
| emergency / non-emergent / context seeking | 536 | 67.0% | 98.1% | 49.4% | 48.7 pp |
| complex / detailed / accuracy | 644 | 61.3% | 8.7% | 40.9% | 32.2 pp |
| complex / simple / accuracy | 652 | 77.3% | 33.7% | 61.8% | 28.1 pp |
| complex / detailed / appropriate | 644 | 64.1% | 81.8% | 55.8% | 26.0 pp |
On hedging/seeks-context criteria, gpt-4o-mini is nearly reversed relative to physicians (0.4% vs 86.5%) while Haiku is merely low (73.6%). On emergency referrals / non-emergent / context seeking, the judges disagree in opposite directions: gpt-4o-mini overestimates by +31 pp, Haiku underestimates by −18 pp.
Qualifying the aggregate-level documentation
The inspect_evals HealthBench README states: “Three judge models were tried on the meta eval (GPT-5-nano, GPT-4-mini, Claude Haiku 4.5) and no statistically significant difference was observed, so we kept gpt-4o-mini as the judge model despite it being outdated.” Our results are compatible with that aggregate statement, but they show that aggregate judge equivalence can still hide severe category-level divergence.
What HealthBench already validates, and what this adds
HealthBench already includes a trustworthiness check: it compares model-based grading to physician grading on consensus criteria and reports agreement metrics (including Macro F1) against physician baselines. That is an important validation step and we do not dispute it.
Our contribution is different. We analyze calibration and category-level behavior under two concrete judges used in practice, then quantify how much oracle labeling is needed to correct aggregate estimates. Agreement can look acceptable in aggregate while category-level errors remain large enough to change conclusions about model behavior.
In short: HealthBench asks whether judge labels broadly track physicians; this audit asks whether reported benchmark scores and category conclusions are numerically reliable for decision-making.
Finding 2: Both judges are overconfident, but by different amounts
| Metric | gpt-4o-mini | Haiku 4.5 |
|---|---|---|
| Mean confidence | 0.916 | 0.801 |
| Physician agreement rate | 67.1% | 67.1% |
| Calibration shift | −24.5 pp | −13.0 pp |
| Binary accuracy | 68.3% | 71.9% |
| Binary F1 | 0.773 | 0.792 |
Both judges are systematically overconfident. gpt-4o-mini's confidence clusters near 1.0 (mean 0.916); Haiku uses more of the range (mean 0.801, min 0.000, max 0.990). But both overshoot the physician ground truth substantially.
If you only look at binary accuracy and F1, the judges look comparable (~68–72% accuracy, ~0.77–0.79 F1). The calibration shift, visible only through the continuous confidence scores, reveals how different the underlying behavior is.
Finding 3: Calibration converges. The ground truth is recoverable
After CJE calibration[1], both judges converge to the same estimate (~0.671) despite very different raw scores:
| Oracle Coverage | gpt-4o-mini | Haiku 4.5 | ||
|---|---|---|---|---|
| Estimate | 95% CI | Estimate | 95% CI | |
| 100% | 0.6711 | [0.666, 0.678] | 0.6712 | [0.666, 0.677] |
| 10% | 0.6828 | [0.667, 0.700] | 0.6770 | [0.661, 0.695] |
| 5% | 0.6918 | [0.669, 0.716] | 0.6852 | [0.662, 0.712] |
At 5% oracle coverage (~1,400 physician labels), the calibrated estimate remains in the same range as the full-data answer (within about 0.6–2.1 pp in this deterministic sample). The larger raw miscalibration of gpt-4o-mini does not prevent recovery.
The practical result
You don't need to relabel everything. A small oracle sample is enough to detect and correct large calibration errors. For a benchmark that already ships 29,511 physician labels, this is trivial. The labels exist, they're just not being used for calibration.
Finding 4: Oracle definition matters more than judge format
Judge format (continuous confidence vs binary met/not-met) makes no difference after calibration. But how you define physician ground truth shifts the calibrated estimate by 10.1 pp—from 0.671 (binary majority) to 0.772 (continuous agreement fraction). This is not a calibration error; it's a measurement decision about whether to preserve physician disagreement information.
Show full 2×2 ablation results
We ran a 2×2 ablation on both judges: judge score format (continuous confidence vs binary met/not-met) × oracle label format (binary physician majority vs continuous physician agreement rate).
| Condition | Judge Format | Oracle Format | gpt-4o-mini | Haiku |
|---|---|---|---|---|
| A | continuous | binary majority | 0.6711 | 0.6711 |
| B | binary | binary majority | 0.6711 | 0.6711 |
| C | continuous | continuous agreement | 0.7717 | 0.7718 |
| D | binary | continuous agreement | 0.7717 | 0.7718 |
Judge format doesn't matter. Continuous vs binary judge scores produce identical calibrated estimates (A = B, C = D). Even though Haiku uses more of the confidence range, it doesn't help.
Oracle definition shifts the estimand by +10.1 pp. Switching from binary physician majority to continuous agreement fraction changes what you're estimating: a score of 3/5 physicians agreeing counts as 0.6 instead of 1.0. Benchmark users should be explicit about which target they're using.
Robustness: Prompt A/B confound
We modified the judge prompt to request confidence scores (which the original HealthBench prompt does not). This changes ~5.1% of individual binary decisions, but aggregate quality metrics are essentially unchanged (accuracy delta: −0.0004, F1 delta: −0.0041). The aggregate overconfidence signal is not an artifact of our prompt modification, though category-level sensitivity still exists.
What this means for benchmark users
Category-level scores are not trustworthy by default
On hedging and context-seeking criteria, the default judge's scores are functionally random or reversed relative to physicians. Without category-specific validation or calibration, a model that genuinely excels at appropriate hedging can be penalized by gpt-4o-mini and, to a lesser extent, by Haiku.
Aggregate scores hide the problem
Both judges achieve ~68–72% binary accuracy overall because category-level overestimation and underestimation cancel out. The aggregate number looks acceptable while individual subscores are severely wrong.
Aggregate judge equivalence can hide instability
Two judges can look similar on overall metrics while disagreeing by up to 73 pp on specific criterion families. Aggregate agreement does not guarantee judge-invariant category conclusions.
Aggregate calibration is cheap
5% oracle coverage (~1,400 labels) recovers the aggregate physician-aligned mean. For a benchmark that already ships 29,511 physician labels, this is operationally cheap. That does not repair broken category-level subscores.
How you define ground truth matters
Binary physician majority vs continuous agreement rate shifts the calibrated estimate by 10.1 pp. Benchmark users should be explicit about which oracle they're targeting.
Limitations
- We audited two judges. A broader sweep (GPT-4.1, GPT-5, Sonnet 4.6) would strengthen the cross-judge findings.
- Category-level CJE calibration (per-category sweeps) is not yet implemented. We calibrate at the aggregate level only.
- The physician labels themselves have imperfect inter-rater agreement. The “ground truth” is noisy.
- We used the meta-eval subset (29,511 records). The full HealthBench has ~48,000 criteria.
- CJE calibration corrects aggregate estimates but does not fix the underlying category-level judge failures. A calibrated aggregate score still averages over broken subscores.
Method
- Data: HealthBench meta-eval[2] (inspect_evals implementation), N=29,511 criterion-level records with physician consensus labels.
- Judges:
gpt-4o-mini(inspect_evals default),claude-haiku-4-5-20251001. Temperature 0.0. Both used a confidence-augmented prompt. A prompt A/B check confirmed this modification flips ~5.1% of decisions but does not materially change aggregate metrics. - Parse success: gpt-4o-mini 29,510/29,511 (99.997%); Haiku 29,501/29,511 (99.97%). Failed records excluded.
- Calibration: CJE mean-preserving isotonic calibration with auto inference mode (which selected clustered bootstrap inference on this dataset),
n_bootstrap=500, and deterministic subsampling (seed base 42, one seed) for reduced-oracle analyses[1]. - Code: Analysis scripts and pre-cached judge outputs are in the reproducibility repo. A general-purpose judge calibration tool is in a merged PR to inspect_evals.
Reproducing this analysis
The analysis scripts, pre-cached judge outputs, and pre-computed results are in the reproducibility repo. You can verify the key numbers without API keys by running CJE calibration on the cached data. If your package index does not yet have the current CJE release, install it directly from source:
The exact commands and flags used for the numbers reported here are documented in the reproducibility repo. Given any set of judge scores and a small sample of oracle (human) labels, CJE produces calibrated estimates with valid confidence intervals. See the paper or the documentation for details.
