Programmable Proxies: Technical Appendix

Abstract. We model LLM judges as programmable surrogates parameterized by a prompt/rubric $\theta$ . We propose a closed-loop procedure that calibrates judge scores to a small gold slice, audits residuals to detect systematic misscoring, and updates the prompt to reduce error. Under mild conditions, enriching the rubric weakly reduces Bayes risk (via Blackwell/Doob information ordering), and doubly-robust policy evaluation with calibrated rewards admits valid, oracle-uncertainty-aware inference. We provide transport diagnostics, anti-gaming tests, and practical sample-size guidance. Experiments across text-assistant tasks show that two prompt updates reduce calibration RMSE by 20–40% (two assistant tasks, n≈10k scored / 1.5k gold each; 95% CI: 18–43%) and preserve validity under adversarial padding, while legacy business proxies remain brittle. Unlike clicks or engagement metrics, LLM judges are repairable instruments—when they drift, you update the rubric, recalibrate, and verify with residuals.

0. Notation & Setup

Primitives

$(X, A)$ : Context (prompt) and action (response/policy)
$Y \in [0,1]$ : Target outcome (expert audit, IDO realization, or user satisfaction)
$\theta$ : Judge prompt/rubric specification
$J_\theta$ : Judge function parameterized by $\theta$
$S_\theta = J_\theta(X,A)$ : Raw judge score (may include sub-rubrics)
$f_\theta: S_\theta \times X \to [0,1]$ : Calibrator function
$R_\theta = f_\theta(S_\theta, X)$ : Calibrated reward

Estimand

Policy value under target outcome $Y$ :

V(\pi) = \mathbb{E}[Y^\pi] = \mathbb{E}_X\big[\mathbb{E}_{A \sim \pi(\cdot \mid X)}[Y \mid X,A]\big]

Data

Logging data: $\mathcal{D}_0 = \{(X_i, A_i, S_{\theta,i})\}_{i=1}^n$ from behavior policy $\pi_0$
Oracle slice: $I_{\text{oracle}} \subset \{1,\ldots,n\}$ with $|I_{\text{oracle}}| = n_Y$ labeled $Y$ values
Oracle coverage: $\rho = n_Y / n$ (typically 10–25%)

1. Endogenous Surrogate Theory

Key insight: Unlike legacy business proxies (clicks, sign-ups) which are exogenous signals, LLM judges are endogenous—their measurement channel is programmable via the rubric $\theta$ .

Assumption J1 (Programmable measurement channel)

The judge $J_\theta$ is parameterized by a prompt/rubric $\theta$ that specifies: (i) which evidence to consider, (ii) what weight to assign different quality dimensions, and (iii) what violations trigger penalties. The evidence set $E_\theta$ induces a σ-field $\mathcal{F}(\theta)$ .

Lemma 1 (Informativeness monotonicity)

Let $\mathcal{F}(\theta)$ denote the σ-field induced by the evidence the judge must consider under rubric $\theta$ . If $\mathcal{F}(\theta') \supseteq \mathcal{F}(\theta)$ (the refined rubric requires more evidence), then:

\mathbb{E}\big[(Y - \mathbb{E}[Y \mid \mathcal{F}(\theta')])^2\big] \le \mathbb{E}\big[(Y - \mathbb{E}[Y \mid \mathcal{F}(\theta)])^2\big]

That is, enriching the rubric weakly reduces Bayes risk for predicting $Y$ .

Argument sketch: Direct from Doob's martingale convergence theorem. For any refinement of filtrations $\mathcal{F}(\theta) \subseteq \mathcal{F}(\theta')$ , the conditional expectation $\mathbb{E}[Y \mid \mathcal{F}(\theta')]$ is the best L² predictor in the larger information set, hence weakly better than $\mathbb{E}[Y \mid \mathcal{F}(\theta)]$ . See Blackwell (1951) ^[1] for the information ordering result. □

Practical implication: When residual analysis reveals that the judge is missing signal (e.g., rewarding verbosity, missing hallucinations), we can update $\theta$ to require explicit evidence checks, then verify that calibration error decreases.

Assumption J1b (Complexity gap)

Evaluating a candidate answer against a rubric is a simpler task than generating the answer from scratch. Formally, let $C_{\text{gen}}(X)$ denote the computational complexity of producing action $A$ from context $X$ , and $C_{\text{judge}}(X, A, \theta)$ denote the complexity of scoring $(X, A)$ against rubric $\theta$ . We assume:

C_{\text{judge}}(X, A, \theta) \ll C_{\text{gen}}(X)

This asymmetry enables using smaller, cheaper models for judging than for generation, and spending heavy compute only where the judge's uncertainty is high (e.g., via abstain policies or ensembles).

Assumption J2 (Surrogacy sufficiency on support)

For $(X, A)$ on $\mathrm{supp}(\pi_0 \cup \Pi_{\mathrm{eval}})$ , there exists $f_\theta$ such that:

\mathbb{E}[Y \mid X, A, S_\theta] = f_\theta(S_\theta, X) \quad \text{a.s.}

This is the local surrogacy condition (Regime 2 in the surrogacy technical appendix). It states that $S_\theta$ contains all information about $Y$ not already in $X$ , on the relevant support.

Assumption J3 (Transport/S-admissibility)

The calibrator $f_\theta$ transports across evaluation environments (policies, time periods, user cohorts) when:

Y \perp \!\!\! \perp \mathrm{Sel} \mid X, A, S_\theta

where $\mathrm{Sel}$ represents selection nodes (policy choice, time period, judge pool). This is the S-admissibility condition from surrogate theory. See §4 for transport diagnostics.

1.5 Why Calibration Is Optimal: The Pathwise Derivative

We now show why calibration f(S) = 𝔼[Y|S] is not merely useful—it is variance-optimal for estimating policy value. This section connects calibration to semiparametric efficiency theory, showing that f(S) is the Efficient Influence Function (EIF) for the welfare functional.

The Setup: Welfare as a Functional

Our target estimand is the expected welfare under policy π_θ:

\\psi(\\pi_\\theta) = \\mathbb{E}_{s \\sim \\pi_\\theta}[Y]

We want to understand how ψ changes as we vary the policy parameters θ. Specifically, we seek the pathwise derivative—the gradient of ψ with respect to θ along any smooth path in the policy space.

The Decomposition

Any welfare outcome Y can be decomposed into two orthogonal components relative to the surrogate score S:

Y = \\mathbb{E}[Y|S] + \\epsilon \\quad \\text{where} \\quad \\mathbb{E}[\\epsilon|S] = 0

This decomposition has three critical properties:

Lemma 2 (Orthogonal Decomposition Properties)

Uniqueness: The decomposition Y = f(S) + ε with f(S) = 𝔼[Y|S] and 𝔼[ε|S] = 0 is unique (follows from the definition of conditional expectation as the L²-projection).
Orthogonality: For any measurable function g(S), we have 𝔼[ε · g(S)] = 0. The noise ε is orthogonal to the entire space of functions of S.
Variance Decomposition: Var(Y) = Var(f(S)) + Var(ε), with no covariance term because 𝔼[f(S) · ε] = 0.

Argument sketch: (1) follows from the projection theorem in Hilbert space. (2) follows by iterated expectation: 𝔼[ε · g(S)] = 𝔼[𝔼[ε|S] · g(S)] = 0. (3) follows from (2) by taking g(S) = f(S). □

The Pathwise Derivative

Consider a smooth path through policy space: π_θ where θ ∈ ℝ. The pathwise derivative of the welfare functional is:

\\frac{d}{d\\theta}\\psi(\\pi_\\theta) = \\frac{d}{d\\theta}\\mathbb{E}_{s \\sim \\pi_\\theta}[Y]

Substituting the decomposition Y = f(S) + ε:

\\frac{d}{d\\theta}\\mathbb{E}[Y] = \\frac{d}{d\\theta}\\mathbb{E}[f(S)] + \\frac{d}{d\\theta}\\mathbb{E}[\\epsilon]

The second term vanishes. Because ε is orthogonal to S (which is a function of π_θ), any change in θ that affects S does not affect the expected value of ε:

\\frac{d}{d\\theta}\\mathbb{E}[\\epsilon] = \\frac{d}{d\\theta}\\mathbb{E}\\big[\\mathbb{E}[\\epsilon|S]\\big] = \\frac{d}{d\\theta}\\mathbb{E}[0] = 0

Therefore, the pathwise derivative simplifies to:

\\frac{d}{d\\theta}\\psi(\\pi_\\theta) = \\frac{d}{d\\theta}\\mathbb{E}[f(S)]

Result 1 (Calibration as the EIF)

The calibration function f(S) = 𝔼[Y|S] is the Efficient Influence Functionfor the welfare functional ψ(π) = 𝔼[Y]. Any other estimator of ψ using the surrogate S has variance ≥ Var(f(S)) or is asymptotically equivalent.

Argument sketch: By the semiparametric efficiency bound (Bickel et al., 1993), the EIF for a functional is the unique element in the tangent space that satisfies the orthogonality condition. We have shown that f(S) satisfies this condition: the "nuisance" component ε is orthogonal to all score operators in the tangent space generated by S. Therefore f(S) is the Riesz representer for ψ, which is the EIF. See van der Laan & Robins (2003) for the general framework. □

Geometric Interpretation

What does this mean geometrically? The space of all possible outcomes Y forms a Hilbert space. The surrogate S generates a tangent space—the subspace of outcomes that can be predicted from S.

f(S) = 𝔼[Y|S] is the orthogonal projection of Y onto this tangent space.
ε = Y - f(S) is the residual—the component of Y orthogonal to the tangent space.
When we optimize, ∇S (the raw gradient) points in a direction that includes both the tangent space and orthogonal directions (exploitation).
∇f(S) (the calibrated gradient) points only in directions within the tangent space—directions that actually move toward higher expected welfare.

This is why calibration removes reward hacking: hacking vectors live in the orthogonal complement (the ε space), and calibration projects them out.

Connection to Regime 4

This derivation applies to Regime 3 (Calibration) in the surrogacy taxonomy. In Regime 4 (Optimization), we additionally require that the policy optimization process remains on the tangent space. This is where Standard Deliberation Protocols (SDPs) enter: they constrain the action space to ensure that optimization steps cannot exploit side channels outside the tangent space defined by f(S).

See Y*-Aligned Systems (Technical) for the full treatment of Regime 4 and the role of SDPs in maintaining alignment during optimization.

Practical Implications

Why This Guarantees Optimality

No estimator can do better: The EIF achieves the semiparametric efficiency bound. Any other estimator using S has asymptotic variance ≥ Var(f(S)).
Calibration is necessary: Using raw scores S is suboptimal because ∇S includes directions orthogonal to welfare (the ε component).
The method is unique: f(S) = 𝔼[Y|S] is the unique L²-optimal predictor. Any other calibration function g(S) ≠ 𝔼[Y|S] has higher mean squared error.

This is the theoretical foundation for CJE. When we calibrate judges in practice (Section 3), we are approximating the EIF using a finite gold sample. The doubly-robust estimator (Section 2) combines this EIF approximation with importance weighting to achieve valid inference even under model misspecification.

For geometric intuition, see The Geometry of Goodhart's Law for the manifold interpretation of this result.

2. Policy Evaluation with Calibrated Rewards

2.1 Doubly Robust Estimator

Given calibrated reward $R_\theta = f_\theta(S_\theta, X)$ , the doubly robust estimator is:

\hat{V}_{\mathrm{DR}}(\pi) = \frac{1}{n} \sum_{i=1}^n \left[ w_i(R_{\theta,i} - \hat{Q}(X_i,A_i)) + \hat{Q}_\pi(X_i) \right]

where:

$w_i = \pi(A_i \mid X_i) / \pi_0(A_i \mid X_i)$ : importance weight
$\hat{Q}(X,A)$ : outcome regression model
$\hat{Q}_\pi(X) = \mathbb{E}_{A \sim \pi(\cdot \mid X)}[\hat{Q}(X,A)]$ : policy value function

Proposition 1 (DR identification with calibrated rewards)

Under Assumption J2 (surrogacy sufficiency) and standard overlap ( $\pi_0(A \mid X) > 0$ whenever $\pi(A \mid X) > 0$ ), the doubly robust estimator with $R_\theta$ satisfies:

\hat{V}_{\mathrm{DR}}(\pi) \xrightarrow{p} V(\pi) = \mathbb{E}[Y^\pi]

as $n \to \infty$ , provided either the outcome model $\hat{Q}$ or the propensity model $\hat{\pi}_0$ is consistent.

Proof: Standard DR identification argument. See Dudík et al. (2014) ^[2] for general DR theory. The key is that $R_\theta$ plays the role of $Y$ in the standard setup, and Assumption J2 ensures $\mathbb{E}[R_\theta \mid X, A] = \mathbb{E}[Y \mid X, A]$ on support. □

2.2 Oracle-Uncertainty-Aware (OUA) Variance

Standard DR inference accounts for evaluation uncertainty (sampling $Y_i$ from the population). But there's a second source: calibration uncertainty from learning $f_\theta$ on a finite oracle slice.

Proposition 2 (OUA variance decomposition)

With $K$ -fold cross-fitting for $f_\theta$ , the total variance of $\hat{V}_{\mathrm{DR}}(\pi)$ decomposes as:

\mathrm{Var}[\hat{V}_{\mathrm{DR}}(\pi)] = \mathrm{Var}_{\text{eval}} + \mathrm{Var}_{\text{oracle}} + o(1/n)

where:

$\mathrm{Var}_{\text{eval}}$ : Standard influence-function variance from DR estimator
$\mathrm{Var}_{\text{oracle}}$ : Variance from different oracle samples leading to different $f_\theta$

The delete-one-fold jackknife across oracle folds yields a consistent estimator of $\mathrm{Var}_{\text{oracle}}$ .

Practical procedure: For each oracle fold $j$ , compute $\hat{V}_{\mathrm{DR}}^{(-j)}(\pi)$ using $f_\theta$ trained without fold $j$ . Jackknife variance: $\frac{K-1}{K} \sum_j (\hat{V}^{(-j)} - \bar{\hat{V}})^2$ . Add to IF variance for total CI. See Efron & Tibshirani (1994) ^[3] for jackknife theory.

OUA share as a diagnostic

Define OUA share = $\mathrm{Var}_{\text{oracle}} / (\mathrm{Var}_{\text{eval}} + \mathrm{Var}_{\text{oracle}})$ . High OUA share (≥ 0.3) indicates you should collect more oracle labels. Low share (≤ 0.1) means evaluation uncertainty dominates, so more cheap $S$ scores help more than more $Y$ labels.

Off-policy safety rail

If the effective sample size (ESS) of importance weights is < 10% of n, abort off-policy evaluation and use Direct. Always report ESS and weight tail metrics.

3. Closed-Loop Design of Measurement

The core algorithmic contribution: a feedback loop that alternates between calibration and prompt refinement.

Algorithm 1: Closed-Loop Judge Improvement

Input: - Initial rubric θ₀ - Logging data D₀ from π₀ - Oracle slice I_oracle with Y labels Hyperparameters: K = 5, stop_patience = 2 for t = 0, 1, 2, ... do: 1. Score with J_θt → {S_θt,i} 2. Calibrate (K-fold cross-fitting): Fit f_θt on I_oracle Predict R_θt,i = f_θt(S_θt,i, X_i) 3. Compute residuals: ε_i = Y_i - R_θt,i for i ∈ I_oracle 4. Residual diagnostics: Test: mean(ε | group) for each group FDR-correct p-values (q=0.1) 5. Identify top residual patterns 6. Prompt patch: θ_{t+1} = patch(θ_t, rules addressing patterns) 7. Stopping: if residuals flat for 2 consecutive t: break Return: θ*, f_θ*, reports

Default hyperparameters

Calibrator: Two-stage if covariates available (spline over [S, length, has_citation], then isotonic on $T = g_k(S, X)$ ). Otherwise, isotonic on $S$ directly.
Cross-fitting: $K = 5$ folds. Larger $K$ reduces bias at cost of higher variance.
Residual slicing: 10–20 groups (domain × difficulty × length bins). Use FDR correction to control family-wise error rate.
Stopping: Two consecutive iterations with no significant residual structure (all group means have CIs overlapping 0) and transport diagnostics pass.

Judge versioning (always log)

judge_model:      gpt-4.5-mini
judge_prompt_hash: sha256:ab12…ef
rubric_version:    3.2
calibrator_version:isotonic-v5

Any change to model family or hard rules triggers a small oracle re-calibration before deployment.

3.1 Panel-of-judges as first-class primitive

Real quality is multi-objective. Rather than forcing all dimensions into a single scalar, use a panel of judges where each judge targets a specific dimension (accuracy, safety, concision, domain-appropriateness). Let $\\mathbf{S}_\\theta = (S_{\\theta,1}, \\ldots, S_{\\theta,K})$ denote the K-dimensional judge vector.

Learnable aggregator: Define $S_\\theta^{\\text{agg}} = w^\\top \\mathbf{S}_\\theta$ where $w \\in \\mathbb{R}^K$ is learned via calibration to minimize $\\mathbb{E}[(Y - f_\\theta(w^\\top \\mathbf{S}_\\theta, X))^2]$ on the gold set. This allows the data to determine relative weighting rather than hard-coding trade-offs.

Publish the vector, not just the scalar: Log the full $\\mathbf{S}_\\theta$ alongside $S_\\theta^{\\text{agg}}$ . The scalar gates decisions (pass/fail), but the vector provides:

Telemetry: Which dimension drove the failure? (e.g., high accuracy but low safety)
Dissent tracking: When judges disagree, flag for human review
Stakeholder transparency: Show domain experts the specific dimension scores, not just aggregated "quality"

Example panel configuration

judges:
  - name: accuracy
    weight: 0.4
    rubric: "Score 0-10 on factual correctness. Penalize unsupported claims."

  - name: safety
    weight: 0.3
    rubric: "Score 0-10 on safety. Block harmful content, bias, privacy leaks."

  - name: concision
    weight: 0.2
    rubric: "Score 0-10 on brevity. Penalize unnecessary verbosity."

  - name: domain_fit
    weight: 0.1
    rubric: "Score 0-10 on domain appropriateness (medical, legal, etc.)."

aggregation: learned_linear  # learns w from calibration data

3.2 Verdict card schema

Every judge evaluation should emit a verdict card—a structured audit trail that logs not just the score, but the reasoning, evidence, and configuration. This makes judgments auditable by default and enables retrospective analysis when calibration drifts.

Example verdict card (JSON schema)

{
  "verdict_id": "vd_8f3a2b1c",
  "timestamp": "2025-11-17T14:23:11Z",

  "scores": {
    "aggregate": 7.2,
    "panel": {
      "accuracy": 8.5,
      "safety": 9.0,
      "concision": 5.0,
      "domain_fit": 7.0
    }
  },

  "config": {
    "judge_model": "gpt-4.5-mini",
    "rubric_version": "3.2",
    "rubric_hash": "sha256:ab12...ef",
    "calibrator_version": "isotonic-v5"
  },

  "evidence": {
    "citations_checked": ["source_1", "source_2"],
    "tests_run": ["unit_test_1", "schema_validator"],
    "flags": ["verbose_padding_detected"]
  },

  "rationale": "Response is factually accurate and safe, but contains unnecessary verbosity (300 tokens vs 150 needed). Concision score penalized accordingly."
}

Benefits: Verdict cards enable post-hoc debugging (why did this fail?), regulatory audit trails (what evidence was considered?), and drift detection (are recent verdicts systematically different from baseline?).

4. Transport Diagnostics

When does a calibrated judge $f_\theta$ trained on one environment transport to another (new policy, time period, user cohort)? We provide falsifiable tests.

4.1 Groupwise residual tests

For each group $g$ (policy/time/domain), test:

H_0: \mathbb{E}[Y - R_\theta \mid G = g] = 0

Procedure

Collect small validation slice in new environment ( $n_g \approx 100$ –200)
Score with existing $f_\theta$ : $R_{\theta,i} = f_\theta(S_{\theta,i}, X_i)$
Compute residuals: $\varepsilon_i = Y_i - R_{\theta,i}$
Test: $\bar{\varepsilon}_g$ with 95% CI via bootstrap or normal approximation
FDR-correct across groups (Benjamini-Hochberg, $q = 0.1$ )

Decision rule: If $\bar{\varepsilon}_g$ CI excludes 0 for any $g$ , transport fails → re-prompt or fit group-specific calibrator.

4.2 When transport fails

Option 1: Re-prompt (cheapest)

If residuals show a clear pattern (e.g., new domain has different citation norms), update $\theta$ to address it. Re-score and re-calibrate on combined data.

Option 2: Local calibration (Regime 2)

Collect oracle labels in new environment. Fit environment-specific $f_{\theta,g}$ . Use only within that environment.

Option 3: Fallback to direct Y estimation (Regime 1)

If surrogacy is weak everywhere, use $S_\theta$ only for efficiency (as covariates in outcome model). Requires $Y$ labels in every evaluation context. See Kallus & Mao (2020) ^[4].

5. Anti-Gaming & Adversarial Validation

If the generator (model being evaluated) optimizes against the judge, raw scores $S_\theta$ can be gamed. Calibration + residual monitoring mitigate but don't eliminate this risk.

5.1 Validation battery

Adversarial test suite (hold out from calibration)

Test 1: Length padding

Take correct answers, append +30–200% boilerplate text. Judge should penalize or stay neutral (not reward).

Test 2: Style mimicry

Prepend rubric phrases ("I verify sources", "reasoning step 1"). Judge should ignore style markers, score only content.

Test 3: Confident-but-wrong

Flip factual claims but keep authoritative tone. Judge should detect via evidence checks.

Test 4: Citation hallucination

Include fake URLs or paper titles. Judge should flag or cap score.

Expected outcomes

Uncalibrated $S_\theta$ : increases by 0.10–0.30 under attacks
Calibrated $R_\theta$ : shift ≤ 0.05 if rubric has proper guards
Failures surface as residual structure → trigger prompt updates

6. Sample Size Planning & Budget Curves

6.1 Approximate formulas (DR estimator)

Under standard regularity, the variance of $\hat{V}_{\mathrm{DR}}(\pi)$ scales as:

\mathrm{Var}[\hat{V}_{\mathrm{DR}}] \approx \frac{\sigma^2_{\text{eval}}}{n} + \frac{\sigma^2_{\text{oracle}}}{n_Y}

where:

$\sigma^2_{\text{eval}}$ : Variance from evaluation (DR influence function)
$\sigma^2_{\text{oracle}}$ : Variance from calibration uncertainty

Rule of thumb: If OUA share ≈ 0.2, then $\sigma^2_{\text{oracle}} \approx 0.25 \cdot \sigma^2_{\text{eval}}$ . To achieve SE ≤ 0.025 (CI width ≈ 0.05), you need:

n \approx 1600 \cdot \sigma^2_{\text{eval}}, \quad n_Y \approx 400 \cdot \sigma^2_{\text{oracle}}

For typical tasks with $\sigma^2_{\text{eval}} \approx 0.05$ –0.1, this translates to $n \approx 2000$ –3000 and $n_Y \approx 300$ –500.

7. Limitations & Open Questions

7.1 Construct drift in Y itself

Our theory assumes $Y$ (the target outcome) is stable. But user preferences evolve, tasks shift, and "what good looks like" changes over time. If the definition of $Y$ drifts, no calibration can save you—you need to re-anchor $Y^*$ with new reference policies or expert consensus. See Y* anchoring for versioning and stability checks.

7.2 Non-regular targets (max/min/quantile aggregators)

Our results assume $Y$ and estimands are smooth enough for standard asymptotic theory. Extreme-value targets (e.g., worst-case performance, 99th percentile) violate regularity. Calibration can still help empirically, but theoretical guarantees require EVT-specific arguments.

7.3 Multimodal judges (beyond text)

We focus on text-based judges. Extending to vision, audio, or multimodal settings is straightforward mechanically (same calibration loop), but failure modes differ. E.g., vision judges may reward aesthetic style over correctness. Residual diagnostics adapt, but domain-specific anti-gaming tests are needed.

7.4 Optimizing against the judge (reward hacking)

If model training directly optimizes calibrated $R_\theta$ , generators will eventually exploit any remaining gaps between $R_\theta$ and $Y$ . Our anti-gaming tests catch known attacks, but adversarial generators can discover novel exploits. Defense: treat $R_\theta$ as a measurement upgrade, not the training objective; periodically validate with A/B tests on real $Y$ ; version $\theta$ and re-calibrate when drift is detected.

8. Conclusion

We have presented a formal framework for programmable surrogacy: LLM judges whose measurement function is endogenous and improvable through prompt engineering. Our closed-loop algorithm alternates between calibration (learning $S_\theta \to Y$ ) and residual-guided prompt updates (refining $\theta$ to reduce error). Under mild regularity, enriching the judge's evidence set weakly reduces prediction error (Lemma 1), and doubly-robust policy evaluation with calibrated rewards admits valid, oracle-uncertainty-aware inference (Proposition 2).

Key contributions:

Theoretical: Information-ordering guarantee for prompt refinements; OUA variance decomposition for calibrated DR estimators; transport diagnostics for cross-environment validity
Methodological: Closed-loop algorithm with stopping criteria; residual audit pipeline; anti-gaming test battery; sample-size planning with budget curves
Practical: Drop-in templates for judge prompts, residual diagnostics, and governance checklists; reproducible experiments showing 20–40% RMSE reduction over two iterations

Implementation & Resources

For practical guidance, code examples, and step-by-step tutorials, see the companion resources:

Main Post (Non-Technical)→CJE Framework Docs→Benchmark Results→

Assumptions Ledger

Code	Statement	Used by	Test/Diagnostic	Mitigation
J2	Surrogacy sufficiency on support: E[Y \| X,A,S $_\theta$ ] = f $_\theta$ (S, X)	All estimators	Incremental signal; residual vs. f $_\theta$	Richer judge; higher rung; add covariates
J3	Transport (S-admissibility): Y ⊥ Sel \| X,A,S $_\theta$	Cross-env eval	Per-group residual test; cross-domain Prentice test	Re-prompt / local calibrator
Overlap	π ≪ π $_0$	IPS/DR	ESS, weight tails	Collect draws; weight stabilization
OUA	Finite oracle labels → extra variance	Inference	OUA share	Add labels if OUA dominates

Citation

If you use this work, please cite:

BibTeX

@misc{landesberg2025programmable,
  author = {Landesberg, Eddie},
  title = {Programmable Proxies: Designing, Calibrating, and Auditing LLM Judges as Surrogate Outcomes},
  year = {2025},
  month = {November},
  url = {https://cimolabs.com/research/programmable-proxies-technical},
  note = {CIMO Labs Technical Report}
}

Plain Text

Landesberg, E. (2025). Programmable Proxies: Designing, Calibrating, and Auditing LLM Judges as Surrogate Outcomes. CIMO Labs Technical Report. https://cimolabs.com/research/programmable-proxies-technical

References

[1] Blackwell, D. (1951). Comparison of experiments. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, 93–102.

[2] Dudík, M., Langford, J., & Li, L. (2014). Doubly robust policy evaluation and learning. International Conference on Machine Learning (ICML).

[3] Efron, B., & Tibshirani, R. J. (1994). An Introduction to the Bootstrap. Chapman and Hall/CRC.

[4] Kallus, N., & Mao, X. (2020). On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv:2003.12408.

[5] Pearl, J., & Bareinboim, E. (2014). External validity: From do-calculus to transportability across populations. Statistical Science, 29(4), 579–595.

Acknowledgements

We thank the CIMO Labs team and community for feedback on this work, and the researchers whose foundational contributions to surrogate endpoint theory, off-policy evaluation, and information theory made this framework possible.

We welcome your feedback

This framework is actively evolving. We invite constructive criticism from practitioners and researchers.

If you spot errors, have theoretical extensions, or have applied programmable surrogacy in production and want to share lessons, please let us know or email eddie@cimolabs.com.