What Are We Really Measuring?
AI quality as surrogacy toward an idealized deliberation oracle
Thesis
Most AI metrics are proxies; treat them as calibrated surrogates for anIdealized Deliberation Oracle (think: the score you'd give after unlimited time and tools) and report policy quality on the oracle scale, with uncertainty that includes the cost of learning that calibration. This post lays out a statistical framework to do this rigorously, which we call Causal Judge Evaluation (CJE).
🎯 The Problem
Your team just shipped a new model. The judge scores were state-of-the-art, but real-world user satisfaction is tanking. What went wrong?
You were measuring the wrong thing. Your judge score was a bad surrogate for what users actually care about.
This post lays out a simple statistical framework to fix this. We'll show you how to define the right target, use your existing metrics as calibrated surrogates, and measure your uncertainty honestly.
📐 For theorists
This post focuses on intuition and practical framing. For rigorous definitions, identification theorems, influence functions, and asymptotic theory, see the Technical Appendix.
The target you wish you could measure
If you (or your users, or your regulators) could spend as long as needed to assess an AI system's behavior for a given task, what would the score be?
Call that random variable Y* (Y-star): an Idealized Deliberation Oracle (IDO). For a policy π (model + prompt + tools + settings), the evaluation target is the policy value:
In plain English: a policy's quality is the average IDO score you'd give to its answers on your real prompts.
V(π) = E[Y*(X, A_π(X))]Quick cheat sheet
S = fast score (cheap) • Y = slower but feasible • Y* = what you'd endorse after deep review
X = context/prompt • A = action/response • π = policy (model + settings)
Three common IDO archetypes
Assistant IDO
Would I endorse this answer after unlimited time and tools? (Personal assistance, code generation, research)
Product IDO
Does this policy increase long-run user value (e.g., 90-day satisfaction/retention) while meeting safety constraints? (Recommendations, search, content moderation)
Business IDO
What's the discounted lifetime value of all current and future customers affected by this policy? Operational metrics (clicks, engagement) are surrogates S; near-term conversions are Y; total discounted cash flows are the ultimate business outcome Y*.
In practice, Y* is expensive or slow (deep human evaluation, task completion with long follow-up, safety audit, long-run satisfaction). So we use cheap surrogates (S): judge scores, clicks, watch-time, short surveys, automated rubric checks that arrive immediately. This creates a ubiquitous data structure: abundant cheap signals but sparse expensive outcomes. This approach builds on decades of statistical work on surrogate endpoints.
⚠️ Scope statement
This framework solves the measurement problem given your values—it doesn't tell you what to value. If your application's IDO is "patient health after informed consent," you can't be satisfied with click-through unless you can calibrate it to that outcome. The choice of what Y* represents is normative; surrogacy is the statistical machinery to bridge cheap signals to that target. If process has intrinsic moral value (e.g., dignity constraints, procedural fairness), encode it into Y* rather than treating it as irrelevant.
Surrogacy: turning fast signals into the thing you care about
A surrogate is useful when it is predictively sufficient for the target, conditional on observed context (X):
E[Y* | X, A, S] = μ(S, X)If that holds, S is "sufficient" once you condition on context; A adds no extra predictive power beyond S and X on the oracle scale.
In practice, we calibrate S against the highest-rung measurement Y we can afford (e.g., human audits, A/B test KPIs, or a GPT-5 judge), which itself serves as our best available proxy for the true Idealized Deliberation Oracle Y*. The framework treats Y as the observable "ground truth" for learning f, while Y* remains the ultimate target we report on.
Under this assumption, you can learn a calibration function f(S,X) ≈ μ(S,X) on a small labeled slice with ground-truth Y, then apply it broadly to compute calibrated rewards R = f(S,X). Estimate V(π) by averaging R for responses produced by π, and quantify uncertainty that includes both sampling error and the fact that f was learned on a finite oracle slice (formal conditions and inference theory in the Technical Appendix).
Notation at a glance
Y*: Idealized Oracle (the target we care about) • Y: Highest practical rung (human audit, A/B KPI, GPT-5) • S: Cheap surrogate (judge score, click, watch-time) • f(S,X): Calibration function • R = f(S,X) on Y-scale • V(π) = E[R|π] (policy value estimate)
The Deliberation Ladder
Instantaneous proxy [S]
Passive telemetry: clicks, dwell, autoplay watch-time; zero rater effort
Quick rating [S]
~30s human tap or LLM judge on a short rubric; single pass, no evidence
Guided review [S↗]
5–10 min rubric-driven review with retrieval/citations; checks for claims & references
Adjudicated audit [S↗]
Multi-rater with tie-break/adjudication, evidence verification, escalation to SME
Causal field outcome (A/B) [Y]
Prospective impact on real tasks/metrics (e.g., 14–90d satisfaction, task success)
IDO (Idealized Deliberation Oracle)
Unlimited time/tools; normative aggregation of constraints & stakeholder weights
Causal Judge Evaluation (CJE)'s job: For each rung k, fit a calibration fk: S(k) → Y* (monotone in the surrogate score, not across k). A practical recipe is two-stage calibration: first build a bias-corrected index T = g(S, X) (e.g., remove verbosity/length/style effects), then fit isotonic T → Y* on a modest higher-k slice. Treat k=4 as a causal Y label: it's closer to Y*, but still a surrogate unless your IDO equals that field metric.
This pattern in practice
Sophisticated teams already operate this way, even without formal IDO terminology:
- Netflix: Learns proxy metrics from past experiments to predict long-term member satisfaction (Netflix Tech Blog)
- YouTube: Shifted from clicks → watch time → survey-calibrated "valued watch time" (YouTube Blog)
- Amazon: Measured 97% directional agreement between offline ranking metrics and online business metrics (Amazon Science)
CJE formalizes this intuition with calibration functions, transportability diagnostics, and honest uncertainty.
Transportability: when does the mapping travel?
Even if f fits on last month's data, it can fail when you change contexts. When does the calibration function f(S,X) transport across:
- Policy: New model/prompt changes style or length effects
- Population: New user cohorts, markets, languages
- Time: Preference drift, product changes
Concrete drift example
We trained on English Q&A with GPT-4.1-nano as judge; now we're evaluating a longer-form coding assistant with a new judge (4.5-nano). Expect length/style to shift the S→Y* mapping; run residual-by-policy tests and, if biased, recalibrate on a small target slice.
So treat "calibration learned on slice A, applied to slice B" as a claim to test. Diagnostics (The formal causal theory for this is in the Technical Appendix §3.5):
🔍 Transport Diagnostics
- ✓Residual by stratum: Is E[Y* − f(S,X) | policy or cohort] = 0?
→ If residuals ≠ 0: recalibrate on a small target-slice or add covariates to f.
- ✓Coverage: Do 95% CIs hit the truth ~95% of the time on hold-out oracle labels?
→ If coverage off: collect more oracle labels at under-covered strata.
- ✓Stratified calibration plots: Across domains/lengths/time
→ If stratified drift: test per-group S1/S2; fall back to K–M style or collect target labels.
✓ Green light rule
Don't promote a model based on surrogates unless residual-by-stratum means are ~0 on a fresh oracle slice. Plot E[Y* - f(S,X) | policy/cohort] with confidence bands; if systematic bias appears (bands don't overlap zero), recalibrate before shipping.
A ladder of surrogates (and why it matters)
Often you have multiple cheap signals S1, …, SK: clicks, watch-time, short satisfaction taps, rubric scores, complaint rate, near-term retention.
Calibrate each—or build a surrogate index that best predicts a longer-run target (e.g., 90-day satisfaction/retention or human audit) by combining multiple short-run signals.
Uncertainty from both steps
When you calibrate a surrogate to predict an oracle, your uncertainty comes from two sources:
1. Sampling variability
Noise in the evaluation log—how many prompts you scored, natural variation in responses
2. Learning variability (OUA)
Oracle-Uncertainty-Aware variance: uncertainty from fitting f on a finite oracle slice. Treating f as known systematically understates total uncertainty.
In practice, refit f on jackknifed folds of the oracle slice and add that "calibration variance" to the base sampling variance—accounting for how errors in f propagate to errors in V(π). This two-part uncertainty is crucial. The Oracle-Uncertainty-Aware (OUA) principle ensures you report honest confidence intervals that reflect both sampling noise and calibration error.Note: OUA variance is zero at 100% oracle coverage because f is not learned—you're directly observing the target label (Y* or your chosen high-rung Y).
Quality as a policy value: what we actually optimize
Given the oracle, V(π) = E[Y*(X, Aπ(X))] is the value of AI policy π. Different application layers plug into the same definition:
- Assistant quality: Does π return the answer you'd endorse after deliberation?
- Planner quality: Does the plan, if followed, lead to IDO-good futures?
- Tool-use quality: Does the chain of calls (search, code, simulate) reach IDO-endorsed outcomes?
The CIMO stance
Define the right target
What you truly care about—not what's easy to measure
Use surrogates intentionally
As calibrated, testable approximations, not as ends in themselves
Make uncertainty honest
Explicitly account for calibration error (OUA)
Routinely test transportability
Refresh a small oracle slice when it drifts across policies, domains, or time
What must be true (assumptions ledger)
| Code | Statement | Test / Diagnostic | Mitigation |
|---|---|---|---|
| S1 | Surrogate sufficiency: E[Y* | X,A,S] = f(S,X) | Incremental signal; residual vs. f | Add covariates; richer judge; higher rung |
| S2 | Transportability: f works across groups | Mean residual = 0 by policy/time | Recalibrate on target group |
| S3 | Overlap: π ≪ π0 (for off-policy) | ESS, tail index, max/median | Weight stabilization; collect draws |
| A1–A3 | IDO well-posed (stability, monotonicity, invariance) | Rung stability checks | Clarify oracle definition |
| OUA | Finite oracle labels → calibration uncertainty | OUA share of total variance | Add labels if OUA dominates |
Why this framing is more than philosophy
A single target clarifies training and evaluation. If the goal is V(π) against IDO, then:
- Training can mix signals as long as they are calibrated surrogates for Y*. RLHF, critique models, rubrics, LLM judges—yes, but as S's to be calibrated and periodically anchored to Y or Y*, not as ersatz truths. (Reward hacking and overoptimization in RLHF: Casper et al., 2023; Gao et al., 2022)
- Evaluation must report on the IDO scale, not in surrogate units. Two systems can have identical surrogate scores but differ on Y* once bias is corrected.
- Governance gets a principled lever: choose the IDO aggregation rule for multi-party settings (who counts and how?) and make it auditable. For multi-stakeholder settings, pick a social aggregator W(·) (e.g., weighted utility or min-satisfy) and evaluate V(π) = E[W(Y*₁, …, Y*ₘ)] on the IDO scale.
A unifying role for "alignment." Much of "alignment" is choosing and approximating the right oracle, not only policing behavior. If your application's IDO is "patient health after informed consent," you can't be satisfied with click-through or short-horizon satisfaction unless you can calibrate them to that outcome.
Connections to existing work
| Practice | In this framework | Failure mode you prevent |
|---|---|---|
| RLHF/RLAIF rewards | S (surrogate); calibrate to Y | Reward hacking on rubric artifacts |
| LLM-as-a-Judge | S; two-stage calibration f(S,X) | Verbosity/format bias, uncalibrated scales ( Zheng et al., 2023) |
| A/B test KPIs | Y (high practical rung) | Over-indexing on short horizon metrics |
| Debate/Amplification | Higher-rung S (closer to Y*) | Weak S→Y link if insufficiently deliberated |
What this is not
We're not claiming clicks or judge scores are bad—only that they are surrogates. Without calibration to Y*, they can drift, incentivize verbosity, or miss long-run value. With calibration and transport tests, they become powerful. This framework doesn't replace metrics; it makes them honest.
See the framework in action
The Arena Experiment demonstrates this framework at scale: we treat GPT-5 as Y (a high practical rung), GPT-4.1-nano as S (cheap surrogate), and show how two-stage calibration with response_length covariates corrects verbosity bias. The post includes full transportability diagnostics (residual tests by policy), OUA variance decomposition, and demonstrates when calibration fails (adversarial policies breaking the S→Y mapping).
Ship checklist: 7 steps to deploy CJE
- 1. Define your IDO — What's the real outcome? (90-day retention, expert audit, social welfare aggregator W(·))
- 2. Pick your rung k — What's the cheap signal S you'll actually collect at scale? (judge score, click, watch-time)
- 3. Collect a small oracle slice — Gather Y (or Y*) stratified by policy/cohort/length; aim for ~500–2000 labeled pairs
- 4. Fit f(S,X) — Two-stage: index T=g(S,X) → isotonic T→Y* with cross-fitting (5-fold recommended)
- 5. Compute R=f(S,X) — Estimate V(π) using Direct / IPS / DR method
- 6. OUA variance via jackknife — Delete-one-fold over oracle folds; report total CI (sampling + oracle uncertainty)
- 7. Transport tests — Residual-by-group; if biased, recalibrate on target slice or fall back to Kallus & Mao style
Limits and open questions
- Value formation. Preferences shift under deliberation. IDO therefore encodes a meta-preference (what you'd endorse after reflection). Be explicit when it's individual vs social.
- When surrogacy fails. Adversarial content can hold high surrogate scores and low IDO value. Diagnostics: policy-wise residuals, extrapolation flags, failure on hold-out audits.
- Path dependence. Process may matter morally (dignity, autonomy). If so, include it in IDO rather than pretending it's irrelevant.
- Cost realism. IDO is a regulative ideal. In practice, climb the ladder as far as budgets allow, and use calibration to bridge the rest.
- Construct inversion. When engagement harms well-being (addictive content, dark patterns), calibration will show residual bias—surrogates predict engagement but not satisfaction. That's a signal the policy objective needs revision, not a statistical failure.
Next: From theory to practice
This post laid out the why: treating AI quality as surrogacy for an idealized deliberation oracle gives you a single target, a theory of proxies, and a design language.
📐 Technical appendix available
For readers who want the formal framework—precise definitions, identification results, influence functions, and asymptotic theory—see the Technical Appendix.
References
Cite this work
APA
Eddie Landesberg. (2025, October 25). What Are We Really Measuring? AI Quality as Surrogacy toward an Idealized Deliberation Oracle. CIMO Labs Blog. https://cimolabs.com/blog/ai-quality-surrogacy
BibTeX
@misc{landesberg2025what,
author = {Eddie Landesberg},
title = {What Are We Really Measuring? AI Quality as Surrogacy toward an Idealized Deliberation Oracle},
howpublished = {\url{https://cimolabs.com/blog/ai-quality-surrogacy}},
year = {2025},
note = {CIMO Labs Blog}
}