What Are We Really Measuring?

AI quality as surrogacy toward an idealized deliberation oracle

Thesis

Most AI metrics are proxies; treat them as calibrated surrogates for anIdealized Deliberation Oracle (think: the score you'd give after unlimited time and tools) and report policy quality on the oracle scale, with uncertainty that includes the cost of learning that calibration. This post lays out a statistical framework to do this rigorously, which we call Causal Judge Evaluation (CJE).

🎯 The Problem

Your team just shipped a new model. The judge scores were state-of-the-art, but real-world user satisfaction is tanking. What went wrong?

You were measuring the wrong thing. Your judge score was a bad surrogate for what users actually care about.

This post lays out a simple statistical framework to fix this. We'll show you how to define the right target, use your existing metrics as calibrated surrogates, and measure your uncertainty honestly.

📐 For theorists

This post focuses on intuition and practical framing. For rigorous definitions, identification theorems, influence functions, and asymptotic theory, see the Technical Appendix.

The target you wish you could measure

If you (or your users, or your regulators) could spend as long as needed to assess an AI system's behavior for a given task, what would the score be?

Call that random variable Y* (Y-star): an Idealized Deliberation Oracle (IDO). For a policy π (model + prompt + tools + settings), the evaluation target is the policy value:

In plain English: a policy's quality is the average IDO score you'd give to its answers on your real prompts.

V(π) = E[Y*(X, A_π(X))]

Quick cheat sheet

S = fast score (cheap) • Y = slower but feasible • Y* = what you'd endorse after deep review

X = context/prompt • A = action/response • π = policy (model + settings)

Three common IDO archetypes

Assistant IDO

Would I endorse this answer after unlimited time and tools? (Personal assistance, code generation, research)

Product IDO

Does this policy increase long-run user value (e.g., 90-day satisfaction/retention) while meeting safety constraints? (Recommendations, search, content moderation)

Business IDO

What's the discounted lifetime value of all current and future customers affected by this policy? Operational metrics (clicks, engagement) are surrogates S; near-term conversions are Y; total discounted cash flows are the ultimate business outcome Y*.

In practice, Y* is expensive or slow (deep human evaluation, task completion with long follow-up, safety audit, long-run satisfaction). So we use cheap surrogates (S): judge scores, clicks, watch-time, short surveys, automated rubric checks that arrive immediately. This creates a ubiquitous data structure: abundant cheap signals but sparse expensive outcomes. This approach builds on decades of statistical work on surrogate endpoints.

⚠️ Scope statement

This framework solves the measurement problem given your values—it doesn't tell you what to value. If your application's IDO is "patient health after informed consent," you can't be satisfied with click-through unless you can calibrate it to that outcome. The choice of what Y* represents is normative; surrogacy is the statistical machinery to bridge cheap signals to that target. If process has intrinsic moral value (e.g., dignity constraints, procedural fairness), encode it into Y* rather than treating it as irrelevant.

Surrogacy: turning fast signals into the thing you care about

A surrogate is useful when it is predictively sufficient for the target, conditional on observed context (X):

E[Y* | X, A, S] = μ(S, X)

If that holds, S is "sufficient" once you condition on context; A adds no extra predictive power beyond S and X on the oracle scale.

In practice, we calibrate S against the highest-rung measurement Y we can afford (e.g., human audits, A/B test KPIs, or a GPT-5 judge), which itself serves as our best available proxy for the true Idealized Deliberation Oracle Y*. The framework treats Y as the observable "ground truth" for learning f, while Y* remains the ultimate target we report on.

Under this assumption, you can learn a calibration function f(S,X) ≈ μ(S,X) on a small labeled slice with ground-truth Y, then apply it broadly to compute calibrated rewards R = f(S,X). Estimate V(π) by averaging R for responses produced by π, and quantify uncertainty that includes both sampling error and the fact that f was learned on a finite oracle slice (formal conditions and inference theory in the Technical Appendix).

Notation at a glance

Y*: Idealized Oracle (the target we care about) • Y: Highest practical rung (human audit, A/B KPI, GPT-5) • S: Cheap surrogate (judge score, click, watch-time) • f(S,X): Calibration function • R = f(S,X) on Y-scale • V(π) = E[R|π] (policy value estimate)

The Deliberation Ladder

k=0

Instantaneous proxy [S]

Passive telemetry: clicks, dwell, autoplay watch-time; zero rater effort

k=1

Quick rating [S]

~30s human tap or LLM judge on a short rubric; single pass, no evidence

k=2

Guided review [S↗]

5–10 min rubric-driven review with retrieval/citations; checks for claims & references

k=3

Adjudicated audit [S↗]

Multi-rater with tie-break/adjudication, evidence verification, escalation to SME

k=4

Causal field outcome (A/B) [Y]

Prospective impact on real tasks/metrics (e.g., 14–90d satisfaction, task success)

k→∞

IDO (Idealized Deliberation Oracle)

Unlimited time/tools; normative aggregation of constraints & stakeholder weights

Causal Judge Evaluation (CJE)'s job: For each rung k, fit a calibration f_k: S^(k) → Y* (monotone in the surrogate score, not across k). A practical recipe is two-stage calibration: first build a bias-corrected index T = g(S, X) (e.g., remove verbosity/length/style effects), then fit isotonic T → Y* on a modest higher-k slice. Treat k=4 as a causal Y label: it's closer to Y*, but still a surrogate unless your IDO equals that field metric.

This pattern in practice

Sophisticated teams already operate this way, even without formal IDO terminology:

Netflix: Learns proxy metrics from past experiments to predict long-term member satisfaction (Netflix Tech Blog)
YouTube: Shifted from clicks → watch time → survey-calibrated "valued watch time" (YouTube Blog)
Amazon: Measured 97% directional agreement between offline ranking metrics and online business metrics (Amazon Science)

CJE formalizes this intuition with calibration functions, transportability diagnostics, and honest uncertainty.

Transportability: when does the mapping travel?

Even if f fits on last month's data, it can fail when you change contexts. When does the calibration function f(S,X) transport across:

Policy: New model/prompt changes style or length effects
Population: New user cohorts, markets, languages
Time: Preference drift, product changes

Concrete drift example

We trained on English Q&A with GPT-4.1-nano as judge; now we're evaluating a longer-form coding assistant with a new judge (4.5-nano). Expect length/style to shift the S→Y* mapping; run residual-by-policy tests and, if biased, recalibrate on a small target slice.

So treat "calibration learned on slice A, applied to slice B" as a claim to test. Diagnostics (The formal causal theory for this is in the Technical Appendix §3.5):

🔍 Transport Diagnostics

✓Residual by stratum: Is E[Y* − f(S,X) | policy or cohort] = 0?
→ If residuals ≠ 0: recalibrate on a small target-slice or add covariates to f.
✓Coverage: Do 95% CIs hit the truth ~95% of the time on hold-out oracle labels?
→ If coverage off: collect more oracle labels at under-covered strata.
✓Stratified calibration plots: Across domains/lengths/time
→ If stratified drift: test per-group S1/S2; fall back to K–M style or collect target labels.

✓ Green light rule

Don't promote a model based on surrogates unless residual-by-stratum means are ~0 on a fresh oracle slice. Plot E[Y* - f(S,X) | policy/cohort] with confidence bands; if systematic bias appears (bands don't overlap zero), recalibrate before shipping.

A ladder of surrogates (and why it matters)

Often you have multiple cheap signals S₁, …, S_K: clicks, watch-time, short satisfaction taps, rubric scores, complaint rate, near-term retention.

Calibrate each—or build a surrogate index that best predicts a longer-run target (e.g., 90-day satisfaction/retention or human audit) by combining multiple short-run signals.

Uncertainty from both steps

When you calibrate a surrogate to predict an oracle, your uncertainty comes from two sources:

1. Sampling variability

Noise in the evaluation log—how many prompts you scored, natural variation in responses

2. Learning variability (OUA)

Oracle-Uncertainty-Aware variance: uncertainty from fitting f on a finite oracle slice. Treating f as known systematically understates total uncertainty.

In practice, refit f on jackknifed folds of the oracle slice and add that "calibration variance" to the base sampling variance—accounting for how errors in f propagate to errors in V(π). This two-part uncertainty is crucial. The Oracle-Uncertainty-Aware (OUA) principle ensures you report honest confidence intervals that reflect both sampling noise and calibration error.Note: OUA variance is zero at 100% oracle coverage because f is not learned—you're directly observing the target label (Y* or your chosen high-rung Y).

Quality as a policy value: what we actually optimize

Given the oracle, V(π) = E[Y*(X, A_π(X))] is the value of AI policy π. Different application layers plug into the same definition:

Assistant quality: Does π return the answer you'd endorse after deliberation?
Planner quality: Does the plan, if followed, lead to IDO-good futures?
Tool-use quality: Does the chain of calls (search, code, simulate) reach IDO-endorsed outcomes?

The CIMO stance

Define the right target

What you truly care about—not what's easy to measure

Use surrogates intentionally

As calibrated, testable approximations, not as ends in themselves

Make uncertainty honest

Explicitly account for calibration error (OUA)

Routinely test transportability

Refresh a small oracle slice when it drifts across policies, domains, or time

What must be true (assumptions ledger)

Assumptions ledger for surrogacy-based AI evaluation
Code	Statement	Test / Diagnostic	Mitigation
S1	Surrogate sufficiency: E[Y* \| X,A,S] = f(S,X)	Incremental signal; residual vs. f	Add covariates; richer judge; higher rung
S2	Transportability: f works across groups	Mean residual = 0 by policy/time	Recalibrate on target group
S3	Overlap: π ≪ π₀ (for off-policy)	ESS, tail index, max/median	Weight stabilization; collect draws
A1–A3	IDO well-posed (stability, monotonicity, invariance)	Rung stability checks	Clarify oracle definition
OUA	Finite oracle labels → calibration uncertainty	OUA share of total variance	Add labels if OUA dominates

Why this framing is more than philosophy

A single target clarifies training and evaluation. If the goal is V(π) against IDO, then:

Training can mix signals as long as they are calibrated surrogates for Y*. RLHF, critique models, rubrics, LLM judges—yes, but as S's to be calibrated and periodically anchored to Y or Y*, not as ersatz truths. (Reward hacking and overoptimization in RLHF: Casper et al., 2023; Gao et al., 2022)
Evaluation must report on the IDO scale, not in surrogate units. Two systems can have identical surrogate scores but differ on Y* once bias is corrected.
Governance gets a principled lever: choose the IDO aggregation rule for multi-party settings (who counts and how?) and make it auditable. For multi-stakeholder settings, pick a social aggregator W(·) (e.g., weighted utility or min-satisfy) and evaluate V(π) = E[W(Y*₁, …, Y*ₘ)] on the IDO scale.

A unifying role for "alignment." Much of "alignment" is choosing and approximating the right oracle, not only policing behavior. If your application's IDO is "patient health after informed consent," you can't be satisfied with click-through or short-horizon satisfaction unless you can calibrate them to that outcome.

Connections to existing work

Practice	In this framework	Failure mode you prevent
RLHF/RLAIF rewards	S (surrogate); calibrate to Y	Reward hacking on rubric artifacts
LLM-as-a-Judge	S; two-stage calibration f(S,X)	Verbosity/format bias, uncalibrated scales ( Zheng et al., 2023)
A/B test KPIs	Y (high practical rung)	Over-indexing on short horizon metrics
Debate/Amplification	Higher-rung S (closer to Y*)	Weak S→Y link if insufficiently deliberated

What this is not

We're not claiming clicks or judge scores are bad—only that they are surrogates. Without calibration to Y*, they can drift, incentivize verbosity, or miss long-run value. With calibration and transport tests, they become powerful. This framework doesn't replace metrics; it makes them honest.

See the framework in action

The Arena Experiment demonstrates this framework at scale: we treat GPT-5 as Y (a high practical rung), GPT-4.1-nano as S (cheap surrogate), and show how two-stage calibration with response_length covariates corrects verbosity bias. The post includes full transportability diagnostics (residual tests by policy), OUA variance decomposition, and demonstrates when calibration fails (adversarial policies breaking the S→Y mapping).

Ship checklist: 7 steps to deploy CJE

1. Define your IDO — What's the real outcome? (90-day retention, expert audit, social welfare aggregator W(·))
2. Pick your rung k — What's the cheap signal S you'll actually collect at scale? (judge score, click, watch-time)
3. Collect a small oracle slice — Gather Y (or Y*) stratified by policy/cohort/length; aim for ~500–2000 labeled pairs
4. Fit f(S,X) — Two-stage: index T=g(S,X) → isotonic T→Y* with cross-fitting (5-fold recommended)
5. Compute R=f(S,X) — Estimate V(π) using Direct / IPS / DR method
6. OUA variance via jackknife — Delete-one-fold over oracle folds; report total CI (sampling + oracle uncertainty)
7. Transport tests — Residual-by-group; if biased, recalibrate on target slice or fall back to Kallus & Mao style

Limits and open questions

Value formation. Preferences shift under deliberation. IDO therefore encodes a meta-preference (what you'd endorse after reflection). Be explicit when it's individual vs social.
When surrogacy fails. Adversarial content can hold high surrogate scores and low IDO value. Diagnostics: policy-wise residuals, extrapolation flags, failure on hold-out audits.
Path dependence. Process may matter morally (dignity, autonomy). If so, include it in IDO rather than pretending it's irrelevant.
Cost realism. IDO is a regulative ideal. In practice, climb the ladder as far as budgets allow, and use calibration to bridge the rest.
Construct inversion. When engagement harms well-being (addictive content, dark patterns), calibration will show residual bias—surrogates predict engagement but not satisfaction. That's a signal the policy objective needs revision, not a statistical failure.

Next: From theory to practice

This post laid out the why: treating AI quality as surrogacy for an idealized deliberation oracle gives you a single target, a theory of proxies, and a design language.

📐 Technical appendix available

For readers who want the formal framework—precise definitions, identification results, influence functions, and asymptotic theory—see the Technical Appendix.

References

[1] Prentice, R. L. (1989). Surrogate endpoints in clinical trials: Definition and operational criteria. Statistics in Medicine, 8(4), 431–440. DOI

[2] Buyse, M., Molenberghs, G., Burzykowski, T., Renard, D., & Geys, H. (2000). The validation of surrogate endpoints in meta-analyses of randomized experiments. Biostatistics, 1(1), 49–67. DOI

[3] Athey, S., Chetty, R., Imbens, G. W., & Kang, H. (2019). The Surrogate Index: Combining Short-Term Proxies to Estimate Long-Term Treatment Effects More Rapidly and Precisely. NBER Working Paper No. 26463. NBER

[4] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. DOI — The foundational paper on Double/Debiased Machine Learning (DML), providing the general statistical recipe (using cross-fitting and doubly-robust scores) to get valid, "honest" uncertainty when plugging in flexible ML models for nuisance components. This is the core machinery that makes the OUA principle statistically rigorous.

[5] Kallus, N., & Mao, X. (2024). On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv:2003.12408. arXiv — Provides the semiparametric efficiency theory for estimating treatment effects with abundant cheap surrogates (S) and sparse expensive outcomes (Y). Their doubly-robust framework formalizes how to combine flexible machine learning for calibration with valid uncertainty quantification, directly grounding CJE's Oracle-Uncertainty-Aware (OUA) principle.

[6] Pearl, J., & Bareinboim, E. (2014). External validity: From do-calculus to transportability across populations. Statistical Science, 29(4), 579–595. DOI — Provides the formal causal framework for transportability—understanding when predictive models or causal effects learned in one setting (population, time, policy) can be applied to another. Grounds the transportability diagnostics and assumptions (S2 in the assumptions ledger) that are critical for validating whether calibration functions transport across contexts.

[7] Casper, S., et al. (2023). Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. arXiv:2307.15217. arXiv — Comprehensive survey of reward hacking, misalignment, and fundamental challenges in RLHF when learned reward models are used as surrogates for human preferences.

[8] Gao, L., Schulman, J., & Hilton, J. (2022). Scaling Laws for Reward Model Overoptimization. arXiv:2210.10760. arXiv — Empirically demonstrates how optimizing too hard against a learned reward model leads to degradation in true human preference alignment (reward overoptimization), quantifying the surrogacy gap in RLHF.

[9] Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685. arXiv — Documents known biases in LLM judges including verbosity bias, position bias, and self-enhancement bias; introduces MT-Bench for evaluating judge quality.

[10] Dubois, Y., et al. (2024). Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. arXiv:2404.04475. arXiv — Demonstrates that LLM judges exhibit strong verbosity bias and shows that length-controlled evaluation (using response length as a covariate) substantially improves reliability and reduces bias.

[11] Irving, G., Christiano, P., & Amodei, D. (2018). AI Safety via Debate. arXiv:1805.00899. arXiv — Proposes debate as a scalable oversight mechanism where two agents argue opposing positions to help humans make better judgments on complex questions; a higher-rung surrogate on the deliberation ladder.

[12] Christiano, P., et al. (2018). Supervising Strong Learners by Amplifying Weak Experts. arXiv:1810.08575. arXiv — Introduces iterated amplification, where weak human judgments are recursively amplified through AI assistance to approximate idealized expert judgment; another approach to stronger surrogates.

[13] YouTube Engineering Blog (2012). "YouTube Now: Why We Focus on Watch Time." Link

Cite this work

APA

Eddie Landesberg. (2025, October 25). What Are We Really Measuring? AI Quality as Surrogacy toward an Idealized Deliberation Oracle. CIMO Labs Blog. https://cimolabs.com/blog/ai-quality-surrogacy

BibTeX

@misc{landesberg2025what,
  author = {Eddie Landesberg},
  title = {What Are We Really Measuring? AI Quality as Surrogacy toward an Idealized Deliberation Oracle},
  howpublished = {\url{https://cimolabs.com/blog/ai-quality-surrogacy}},
  year = {2025},
  note = {CIMO Labs Blog}
}

Learn about CJE