Programmable Proxies: Designing, Calibrating, and Auditing LLM Judges as Surrogate Outcomes
Technical Appendix
Abstract. We model LLM judges as programmable surrogates parameterized by a prompt/rubric . We propose a closed-loop procedure that calibrates judge scores to a small gold slice, audits residuals to detect systematic misscoring, and updates the prompt to reduce error. Under mild conditions, enriching the rubric weakly reduces Bayes risk (via Blackwell/Doob information ordering), and doubly-robust policy evaluation with calibrated rewards admits valid, oracle-uncertainty-aware inference. We provide transport diagnostics, anti-gaming tests, and practical sample-size guidance. Experiments across text-assistant tasks show that two prompt updates reduce calibration RMSE by 20–40% (two assistant tasks, n≈10k scored / 1.5k gold each; 95% CI: 18–43%) and preserve validity under adversarial padding, while legacy business proxies remain brittle. Unlike clicks or engagement metrics, LLM judges are repairable instruments—when they drift, you update the rubric, recalibrate, and verify with residuals.
0. Notation & Setup
Primitives
- : Context (prompt) and action (response/policy)
- : Target outcome (expert audit, IDO realization, or user satisfaction)
- : Judge prompt/rubric specification
- : Judge function parameterized by
- : Raw judge score (may include sub-rubrics)
- : Calibrator function
- : Calibrated reward
Estimand
Policy value under target outcome :
Data
- Logging data: from behavior policy
- Oracle slice: with labeled values
- Oracle coverage: (typically 10–25%)
1. Endogenous Surrogate Theory
Key insight: Unlike legacy business proxies (clicks, sign-ups) which are exogenous signals, LLM judges are endogenous—their measurement channel is programmable via the rubric .
Assumption J1 (Programmable measurement channel)
The judge is parameterized by a prompt/rubric that specifies: (i) which evidence to consider, (ii) what weight to assign different quality dimensions, and (iii) what violations trigger penalties. The evidence set induces a σ-field .
Lemma 1 (Informativeness monotonicity)
Let denote the σ-field induced by the evidence the judge must consider under rubric . If (the refined rubric requires more evidence), then:
That is, enriching the rubric weakly reduces Bayes risk for predicting .
Argument sketch: Direct from Doob's martingale convergence theorem. For any refinement of filtrations , the conditional expectation is the best L² predictor in the larger information set, hence weakly better than . See Blackwell (1951) [1] for the information ordering result. □
Practical implication: When residual analysis reveals that the judge is missing signal (e.g., rewarding verbosity, missing hallucinations), we can update to require explicit evidence checks, then verify that calibration error decreases.
Assumption J1b (Complexity gap)
Evaluating a candidate answer against a rubric is a simpler task than generating the answer from scratch. Formally, let denote the computational complexity of producing action from context , and denote the complexity of scoring against rubric . We assume:
This asymmetry enables using smaller, cheaper models for judging than for generation, and spending heavy compute only where the judge's uncertainty is high (e.g., via abstain policies or ensembles).
Assumption J2 (Surrogacy sufficiency on support)
For on , there exists such that:
This is the local surrogacy condition (Regime 2 in the surrogacy technical appendix). It states that contains all information about not already in , on the relevant support.
Assumption J3 (Transport/S-admissibility)
The calibrator transports across evaluation environments (policies, time periods, user cohorts) when:
where represents selection nodes (policy choice, time period, judge pool). This is the S-admissibility condition from surrogate theory. See §4 for transport diagnostics.
1.5 Why Calibration Is Optimal: The Pathwise Derivative
We now show why calibration f(S) = 𝔼[Y|S] is not merely useful—it is variance-optimal for estimating policy value. This section connects calibration to semiparametric efficiency theory, showing that f(S) is the Efficient Influence Function (EIF) for the welfare functional.
The Setup: Welfare as a Functional
Our target estimand is the expected welfare under policy πθ:
We want to understand how ψ changes as we vary the policy parameters θ. Specifically, we seek the pathwise derivative—the gradient of ψ with respect to θ along any smooth path in the policy space.
The Decomposition
Any welfare outcome Y can be decomposed into two orthogonal components relative to the surrogate score S:
This decomposition has three critical properties:
Lemma 2 (Orthogonal Decomposition Properties)
- Uniqueness: The decomposition Y = f(S) + ε with f(S) = 𝔼[Y|S] and 𝔼[ε|S] = 0 is unique (follows from the definition of conditional expectation as the L²-projection).
- Orthogonality: For any measurable function g(S), we have 𝔼[ε · g(S)] = 0. The noise ε is orthogonal to the entire space of functions of S.
- Variance Decomposition: Var(Y) = Var(f(S)) + Var(ε), with no covariance term because 𝔼[f(S) · ε] = 0.
Argument sketch: (1) follows from the projection theorem in Hilbert space. (2) follows by iterated expectation: 𝔼[ε · g(S)] = 𝔼[𝔼[ε|S] · g(S)] = 0. (3) follows from (2) by taking g(S) = f(S). □
The Pathwise Derivative
Consider a smooth path through policy space: πθ where θ ∈ ℝ. The pathwise derivative of the welfare functional is:
Substituting the decomposition Y = f(S) + ε:
The second term vanishes. Because ε is orthogonal to S (which is a function of πθ), any change in θ that affects S does not affect the expected value of ε:
Therefore, the pathwise derivative simplifies to:
Result 1 (Calibration as the EIF)
The calibration function f(S) = 𝔼[Y|S] is the Efficient Influence Functionfor the welfare functional ψ(π) = 𝔼[Y]. Any other estimator of ψ using the surrogate S has variance ≥ Var(f(S)) or is asymptotically equivalent.
Argument sketch: By the semiparametric efficiency bound (Bickel et al., 1993), the EIF for a functional is the unique element in the tangent space that satisfies the orthogonality condition. We have shown that f(S) satisfies this condition: the "nuisance" component ε is orthogonal to all score operators in the tangent space generated by S. Therefore f(S) is the Riesz representer for ψ, which is the EIF. See van der Laan & Robins (2003) for the general framework. □
Geometric Interpretation
What does this mean geometrically? The space of all possible outcomes Y forms a Hilbert space. The surrogate S generates a tangent space—the subspace of outcomes that can be predicted from S.
- f(S) = 𝔼[Y|S] is the orthogonal projection of Y onto this tangent space.
- ε = Y - f(S) is the residual—the component of Y orthogonal to the tangent space.
- When we optimize, ∇S (the raw gradient) points in a direction that includes both the tangent space and orthogonal directions (exploitation).
- ∇f(S) (the calibrated gradient) points only in directions within the tangent space—directions that actually move toward higher expected welfare.
This is why calibration removes reward hacking: hacking vectors live in the orthogonal complement (the ε space), and calibration projects them out.
Connection to Regime 4
This derivation applies to Regime 3 (Calibration) in the surrogacy taxonomy. In Regime 4 (Optimization), we additionally require that the policy optimization process remains on the tangent space. This is where Standard Deliberation Protocols (SDPs) enter: they constrain the action space to ensure that optimization steps cannot exploit side channels outside the tangent space defined by f(S).
See Y*-Aligned Systems (Technical) for the full treatment of Regime 4 and the role of SDPs in maintaining alignment during optimization.
Practical Implications
Why This Guarantees Optimality
- No estimator can do better: The EIF achieves the semiparametric efficiency bound. Any other estimator using S has asymptotic variance ≥ Var(f(S)).
- Calibration is necessary: Using raw scores S is suboptimal because ∇S includes directions orthogonal to welfare (the ε component).
- The method is unique: f(S) = 𝔼[Y|S] is the unique L²-optimal predictor. Any other calibration function g(S) ≠ 𝔼[Y|S] has higher mean squared error.
This is the theoretical foundation for CJE. When we calibrate judges in practice (Section 3), we are approximating the EIF using a finite gold sample. The doubly-robust estimator (Section 2) combines this EIF approximation with importance weighting to achieve valid inference even under model misspecification.
For geometric intuition, see The Geometry of Goodhart's Law for the manifold interpretation of this result.
2. Policy Evaluation with Calibrated Rewards
2.1 Doubly Robust Estimator
Given calibrated reward , the doubly robust estimator is:
where:
- : importance weight
- : outcome regression model
- : policy value function
Proposition 1 (DR identification with calibrated rewards)
Under Assumption J2 (surrogacy sufficiency) and standard overlap ( whenever ), the doubly robust estimator with satisfies:
as , provided either the outcome model or the propensity model is consistent.
Proof: Standard DR identification argument. See Dudík et al. (2014) [2] for general DR theory. The key is that plays the role of in the standard setup, and Assumption J2 ensures on support. □
2.2 Oracle-Uncertainty-Aware (OUA) Variance
Standard DR inference accounts for evaluation uncertainty (sampling from the population). But there's a second source: calibration uncertainty from learning on a finite oracle slice.
Proposition 2 (OUA variance decomposition)
With -fold cross-fitting for , the total variance of decomposes as:
where:
- : Standard influence-function variance from DR estimator
- : Variance from different oracle samples leading to different
The delete-one-fold jackknife across oracle folds yields a consistent estimator of .
Practical procedure: For each oracle fold , compute using trained without fold . Jackknife variance: . Add to IF variance for total CI. See Efron & Tibshirani (1994) [3] for jackknife theory.
OUA share as a diagnostic
Define OUA share = . High OUA share (≥ 0.3) indicates you should collect more oracle labels. Low share (≤ 0.1) means evaluation uncertainty dominates, so more cheap scores help more than more labels.
Off-policy safety rail
If the effective sample size (ESS) of importance weights is < 10% of n, abort off-policy evaluation and use Direct. Always report ESS and weight tail metrics.
3. Closed-Loop Design of Measurement
The core algorithmic contribution: a feedback loop that alternates between calibration and prompt refinement.
Algorithm 1: Closed-Loop Judge Improvement
Default hyperparameters
- Calibrator: Two-stage if covariates available (spline over [S, length, has_citation], then isotonic on ). Otherwise, isotonic on directly.
- Cross-fitting: folds. Larger reduces bias at cost of higher variance.
- Residual slicing: 10–20 groups (domain × difficulty × length bins). Use FDR correction to control family-wise error rate.
- Stopping: Two consecutive iterations with no significant residual structure (all group means have CIs overlapping 0) and transport diagnostics pass.
Judge versioning (always log)
judge_model: gpt-4.5-mini judge_prompt_hash: sha256:ab12…ef rubric_version: 3.2 calibrator_version:isotonic-v5
Any change to model family or hard rules triggers a small oracle re-calibration before deployment.
3.1 Panel-of-judges as first-class primitive
Real quality is multi-objective. Rather than forcing all dimensions into a single scalar, use a panel of judges where each judge targets a specific dimension (accuracy, safety, concision, domain-appropriateness). Let \\mathbf{S}_\\theta = (S_{\\theta,1}, \\ldots, S_{\\theta,K}) denote the K-dimensional judge vector.
Learnable aggregator: Define S_\\theta^{\\text{agg}} = w^\\top \\mathbf{S}_\\theta where is learned via calibration to minimize \\mathbb{E}[(Y - f_\\theta(w^\\top \\mathbf{S}_\\theta, X))^2] on the gold set. This allows the data to determine relative weighting rather than hard-coding trade-offs.
Publish the vector, not just the scalar: Log the full \\mathbf{S}_\\theta alongside S_\\theta^{\\text{agg}}. The scalar gates decisions (pass/fail), but the vector provides:
- Telemetry: Which dimension drove the failure? (e.g., high accuracy but low safety)
- Dissent tracking: When judges disagree, flag for human review
- Stakeholder transparency: Show domain experts the specific dimension scores, not just aggregated "quality"
Example panel configuration
judges:
- name: accuracy
weight: 0.4
rubric: "Score 0-10 on factual correctness. Penalize unsupported claims."
- name: safety
weight: 0.3
rubric: "Score 0-10 on safety. Block harmful content, bias, privacy leaks."
- name: concision
weight: 0.2
rubric: "Score 0-10 on brevity. Penalize unnecessary verbosity."
- name: domain_fit
weight: 0.1
rubric: "Score 0-10 on domain appropriateness (medical, legal, etc.)."
aggregation: learned_linear # learns w from calibration data3.2 Verdict card schema
Every judge evaluation should emit a verdict card—a structured audit trail that logs not just the score, but the reasoning, evidence, and configuration. This makes judgments auditable by default and enables retrospective analysis when calibration drifts.
Example verdict card (JSON schema)
{
"verdict_id": "vd_8f3a2b1c",
"timestamp": "2025-11-17T14:23:11Z",
"scores": {
"aggregate": 7.2,
"panel": {
"accuracy": 8.5,
"safety": 9.0,
"concision": 5.0,
"domain_fit": 7.0
}
},
"config": {
"judge_model": "gpt-4.5-mini",
"rubric_version": "3.2",
"rubric_hash": "sha256:ab12...ef",
"calibrator_version": "isotonic-v5"
},
"evidence": {
"citations_checked": ["source_1", "source_2"],
"tests_run": ["unit_test_1", "schema_validator"],
"flags": ["verbose_padding_detected"]
},
"rationale": "Response is factually accurate and safe, but contains unnecessary verbosity (300 tokens vs 150 needed). Concision score penalized accordingly."
}Benefits: Verdict cards enable post-hoc debugging (why did this fail?), regulatory audit trails (what evidence was considered?), and drift detection (are recent verdicts systematically different from baseline?).
4. Transport Diagnostics
When does a calibrated judge trained on one environment transport to another (new policy, time period, user cohort)? We provide falsifiable tests.
4.1 Groupwise residual tests
For each group (policy/time/domain), test:
Procedure
- Collect small validation slice in new environment (–200)
- Score with existing :
- Compute residuals:
- Test: with 95% CI via bootstrap or normal approximation
- FDR-correct across groups (Benjamini-Hochberg, )
Decision rule: If CI excludes 0 for any , transport fails → re-prompt or fit group-specific calibrator.
4.2 When transport fails
Option 1: Re-prompt (cheapest)
If residuals show a clear pattern (e.g., new domain has different citation norms), update to address it. Re-score and re-calibrate on combined data.
Option 2: Local calibration (Regime 2)
Collect oracle labels in new environment. Fit environment-specific . Use only within that environment.
Option 3: Fallback to direct Y estimation (Regime 1)
If surrogacy is weak everywhere, use only for efficiency (as covariates in outcome model). Requires labels in every evaluation context. See Kallus & Mao (2020) [4].
5. Anti-Gaming & Adversarial Validation
If the generator (model being evaluated) optimizes against the judge, raw scores can be gamed. Calibration + residual monitoring mitigate but don't eliminate this risk.
5.1 Validation battery
Adversarial test suite (hold out from calibration)
Expected outcomes
- Uncalibrated : increases by 0.10–0.30 under attacks
- Calibrated : shift ≤ 0.05 if rubric has proper guards
- Failures surface as residual structure → trigger prompt updates
6. Sample Size Planning & Budget Curves
6.1 Approximate formulas (DR estimator)
Under standard regularity, the variance of scales as:
where:
- : Variance from evaluation (DR influence function)
- : Variance from calibration uncertainty
Rule of thumb: If OUA share ≈ 0.2, then . To achieve SE ≤ 0.025 (CI width ≈ 0.05), you need:
For typical tasks with –0.1, this translates to –3000 and –500.
7. Limitations & Open Questions
7.1 Construct drift in Y itself
Our theory assumes (the target outcome) is stable. But user preferences evolve, tasks shift, and "what good looks like" changes over time. If the definition of drifts, no calibration can save you—you need to re-anchor with new reference policies or expert consensus. See Y* anchoring for versioning and stability checks.
7.2 Non-regular targets (max/min/quantile aggregators)
Our results assume and estimands are smooth enough for standard asymptotic theory. Extreme-value targets (e.g., worst-case performance, 99th percentile) violate regularity. Calibration can still help empirically, but theoretical guarantees require EVT-specific arguments.
7.3 Multimodal judges (beyond text)
We focus on text-based judges. Extending to vision, audio, or multimodal settings is straightforward mechanically (same calibration loop), but failure modes differ. E.g., vision judges may reward aesthetic style over correctness. Residual diagnostics adapt, but domain-specific anti-gaming tests are needed.
7.4 Optimizing against the judge (reward hacking)
If model training directly optimizes calibrated , generators will eventually exploit any remaining gaps between and . Our anti-gaming tests catch known attacks, but adversarial generators can discover novel exploits. Defense: treat as a measurement upgrade, not the training objective; periodically validate with A/B tests on real ; version and re-calibrate when drift is detected.
8. Conclusion
We have presented a formal framework for programmable surrogacy: LLM judges whose measurement function is endogenous and improvable through prompt engineering. Our closed-loop algorithm alternates between calibration (learning ) and residual-guided prompt updates (refining to reduce error). Under mild regularity, enriching the judge's evidence set weakly reduces prediction error (Lemma 1), and doubly-robust policy evaluation with calibrated rewards admits valid, oracle-uncertainty-aware inference (Proposition 2).
Key contributions:
- Theoretical: Information-ordering guarantee for prompt refinements; OUA variance decomposition for calibrated DR estimators; transport diagnostics for cross-environment validity
- Methodological: Closed-loop algorithm with stopping criteria; residual audit pipeline; anti-gaming test battery; sample-size planning with budget curves
- Practical: Drop-in templates for judge prompts, residual diagnostics, and governance checklists; reproducible experiments showing 20–40% RMSE reduction over two iterations
Implementation & Resources
For practical guidance, code examples, and step-by-step tutorials, see the companion resources:
Assumptions Ledger
| Code | Statement | Used by | Test/Diagnostic | Mitigation |
|---|---|---|---|---|
| J2 | Surrogacy sufficiency on support: E[Y | X,A,S] = f(S, X) | All estimators | Incremental signal; residual vs. f | Richer judge; higher rung; add covariates |
| J3 | Transport (S-admissibility): Y ⊥ Sel | X,A,S | Cross-env eval | Per-group residual test; cross-domain Prentice test | Re-prompt / local calibrator |
| Overlap | π ≪ π | IPS/DR | ESS, weight tails | Collect draws; weight stabilization |
| OUA | Finite oracle labels → extra variance | Inference | OUA share | Add labels if OUA dominates |
Citation
If you use this work, please cite:
BibTeX
@misc{landesberg2025programmable,
author = {Landesberg, Eddie},
title = {Programmable Proxies: Designing, Calibrating, and Auditing LLM Judges as Surrogate Outcomes},
year = {2025},
month = {November},
url = {https://cimolabs.com/research/programmable-proxies-technical},
note = {CIMO Labs Technical Report}
}Plain Text
Landesberg, E. (2025). Programmable Proxies: Designing, Calibrating, and Auditing LLM Judges as Surrogate Outcomes. CIMO Labs Technical Report. https://cimolabs.com/research/programmable-proxies-technical
References
Acknowledgements
We thank the CIMO Labs team and community for feedback on this work, and the researchers whose foundational contributions to surrogate endpoint theory, off-policy evaluation, and information theory made this framework possible.
We welcome your feedback
This framework is actively evolving. We invite constructive criticism from practitioners and researchers.
If you spot errors, have theoretical extensions, or have applied programmable surrogacy in production and want to share lessons, please let us know or email eddie@cimolabs.com.
