CIMO LabsCIMO Labs
← Back to conceptual overview

AI Quality as Surrogacy for Idealized Deliberation: Technical Appendix

Formal framework with precise definitions, identification results, influence functions, and asymptotic theory.

Prerequisites: This appendix assumes familiarity with semiparametric efficiency theory, influence functions, and causal inference. For the conceptual introduction, see the main post.

0. Notation and spaces

  • Context space X\mathcal{X}, action space A\mathcal{A} (text, code, plans), score space SRd\mathcal{S} \subset \mathbb{R}^d.
  • A policy π\pi maps xXx \in \mathcal{X} to a distribution π(x)\pi(\cdot \mid x) on A\mathcal{A}. Let Π\Pi be a class of admissible policies.
  • XPXX \sim P_X denotes the population distribution of contexts; we treat single-turn first and extend to trajectories in §10.
  • An Idealized Deliberation Oracle (IDO) is a functional Y:X×A[0,1]Y^*: \mathcal{X} \times \mathcal{A} \to [0,1] representing the normalized evaluation under idealized deliberation. See the utility semantics box below for a precise definition.
  • A judge (or surrogate measurement process) JJ maps (x,a)(x,a) to a random score SSS \in \mathcal{S}. We allow a ladder of rungs
S(0),S(1),,S(K)(increasing effort 0<1<<K)S^{(0)}, S^{(1)}, \ldots, S^{(K)} \qquad (\text{increasing effort } 0 < 1 < \cdots < K)

induced by a filtration F0FK\mathcal{F}_0 \subset \cdots \subset \mathcal{F}_K with S(k)S^{(k)} measurable w.r.t. Fk\mathcal{F}_k.

Target quality

For any πΠ\pi \in \Pi,

V(π):=E[Y(X,Aπ(X))],Aπ(X)π(X)V(\pi) := \mathbb{E}[Y^*(X, A_\pi(X))], \qquad A_\pi(X) \sim \pi(\cdot \mid X)

IDO semantics (utility view)

Fix an outcome space Ω\Omega, a kernel P(dωx,a)P(d\omega \mid x,a), a utility U:ΩRmU:\Omega\to\mathbb{R}^m, an optional social aggregator W:RmRW:\mathbb{R}^m\to\mathbb{R}, a risk/aggregation functional F:P(R)RF:\mathcal{P}(\mathbb{R})\to\mathbb{R} (e.g. mean, CVaR), and a strictly increasing normalization N:R[0,1]N:\mathbb{R}\to[0,1].

Y(x,a)  =  N ⁣(F(Law[W(U(ω))X=x,A=a]))Y^*(x,a) \;=\; N\!\Big(F\big(\mathsf{Law}[\,W(U(\omega))\,\mid X{=}x,A{=}a]\big)\Big)

Defaults: if single-stakeholder (m=1m=1), WW is identity; otherwise pick WW (e.g., weighted sum, max–min). FF expectation, NN reference-policy anchoring: N(u)=(uF(PUπlow))/(F(PUπhigh)F(PUπlow))N(u) = (u - F(P_{U\mid \pi_{\text{low}}})) / (F(P_{U\mid \pi_{\text{high}}}) - F(P_{U\mid \pi_{\text{low}}})). Record (U,F,N,W)(U,F,N,W) in the assumptions ledger.

1. Axioms for the IDO (normative)

Let Y(x,a)Y^*(x,a) be the limiting value of a deliberation procedure.

  • A1 (Deliberative stability). There exists a sequence of increasing-effort labels Y(k)(x,a)Y^{(k)}(x,a) s.t. Y(k)(x,a)Y(x,a)Y^{(k)}(x,a) \to Y^*(x,a) in L2L^2 as kk \to \infty.
  • A2 (Evidence monotonicity). If FkFk\mathcal{F}_k \subseteq \mathcal{F}_{k'}, then
    E[(YE[YFk])2]E[(YE[YFk])2]\mathbb{E}[(Y^* - \mathbb{E}[Y^* \mid \mathcal{F}_{k'}])^2] \le \mathbb{E}[(Y^* - \mathbb{E}[Y^* \mid \mathcal{F}_k])^2]
  • A3 (Instrumental invariance). If two procedures yield the same world-state relevant to the objective, they have equal YY^*.

A1–A3 make YY^* a well-defined limit of a "deliberation ladder."

Note. If (U,F,N,W)(U,F,N,W) change across environments, selection enters YY^* and the calibration fkf_k will not transport (§3.5 table, row "SYS\to Y^*"). Record (U,F,N,W)(U,F,N,W) in the assumptions ledger (§12).

2. Surrogacy (structural) and transport (stability) assumptions

  • S1 (IDO-surrogate sufficiency at rung kk). There exists a measurable fk:S×X[0,1]f_k: \mathcal{S} \times \mathcal{X} \to [0,1] s.t.
    E[YX,A,S(k)]=fk(S(k),X)a.s.\mathbb{E}[Y^* \mid X, A, S^{(k)}] = f_k(S^{(k)}, X) \qquad \text{a.s.}
    (Optionally add monotonicity in a one-dimensional risk index T=gk(S(k),X)T = g_k(S^{(k)}, X).)
  • S2 (Transportability across policies/time). For a collection G\mathcal{G} of environments (policies, cohorts, time), the same fkf_k works: for all gGg \in \mathcal{G},
    Eg[YX,A,S(k)]=fk(S(k),X)\mathbb{E}_g[Y^* \mid X, A, S^{(k)}] = f_k(S^{(k)}, X)
    Graphical test (Pearl & Bareinboim, 2014[7]): In a selection diagram modeling environment differences via selection nodes, S2 holds if S(k)S^{(k)} is S-admissible: Y ⁣ ⁣ ⁣SX,A,S(k)Y^* \perp \!\!\! \perp S \mid X, A, S^{(k)} in the diagram with incoming arrows to AA removed, where SS represents selection nodes. Intuitively: calibration transports if no selection node points into YY^* given the surrogate.
  • S3 (Positivity/overlap for off-policy re-use). If estimating V(π)V(\pi) from logs of π0\pi_0, then π(ax)>0π0(ax)>0\pi(a \mid x) > 0 \Rightarrow \pi_0(a \mid x) > 0 a.s.
  • S4 (Judge availability). For any π\pi used in Direct mode, we can obtain S(k)(X,Aπ(X))S^{(k)}(X, A_\pi(X)) at scale; for OPE/DR, we have S(k)(X,Aπ0(X))S^{(k)}(X, A_{\pi_0}(X)) in logs.
  • L1 (Oracle MAR). Let L{0,1}L \in \{0,1\} indicate whether an example received an oracle label YY^*. Then LY(X,A,S(k))L \perp Y^* \mid (X, A, S^{(k)}). Oracle labeling is ignorable conditional on observed surrogates and covariates.
  • L2 (Oracle positivity). P(L=1X,A,S(k))>0P(L=1 \mid X, A, S^{(k)}) > 0 on the support where fkf_k will be applied. Ensures calibration function is identifiable and transportable.

Relationship to Kallus & Mao (2024)[1]

Kallus & Mao study a related but distinct problem: estimating treatment effects on a primary outcome YY when labels are sparse but surrogates SS are abundant. Their framework does not assume surrogacy sufficiency (our S1)—surrogates improve precision without replacing YY. They derive semiparametrically efficient doubly robust estimators under unconfoundedness and MAR labeling, using cross-fitting to handle flexible nuisance estimation.

In contrast, our IDO framework treats the calibrated R(k)R^{(k)} as the target outcome scale (§0), relying on S1–S2 for identification. The key practical distinction is transportability:

  • IDO with S1–S2: Calibration function fkf_k transports across datasets/policies (S2). Calibrate once on an oracle slice → evaluate many new policies on different data using only S(k)S^{(k)}. Example: calibrate on 1000 Arena prompts with GPT-5 labels, then evaluate 100 policies on 10,000 new prompts with only GPT-4.1-nano scores. Separation of calibration and evaluation is the key efficiency gain.
  • K&M without S1: No transport assumption. Calibration and estimation must happen on the same dataset. Every new evaluation context requires YY labels for some units in that context. More robust to S1/S2 failures, but you can't amortize calibration investment across contexts.
  • OUA variance: Our jackknife (§5.3) accounts for uncertainty in learning fkf_k under S1–S2. K&M's EIF includes analogous calibration uncertainty via cross-fitting, but without assuming the calibration transports.

The frameworks are complementary: use IDO+surrogacy when S2 (transport) is credible and you want to amortize calibration across many evaluation contexts; use K&M-style estimation when transport is suspect and you're willing to collect YY labels in each new evaluation dataset. Our transport test (§6) is critical for detecting S2 violations.

3. Identification

Let R(k)=fk(S(k),X)R^{(k)} = f_k(S^{(k)}, X) be the calibrated reward on the IDO scale.

Proposition 1 (Direct identification)

Under S1 (and S2 + L1–L2 if fkf_k learned out-of-domain),

V(π)=E[Rπ(k)],Rπ(k):=fk(S(k)(X,Aπ(X)),X)V(\pi) = \mathbb{E}[R^{(k)}_\pi], \qquad R^{(k)}_\pi := f_k(S^{(k)}(X, A_\pi(X)), X)

Proof sketch. E[YX,A,S(k)]=R(k)E[YX,Aπ(X)]=E[Rπ(k)X]\mathbb{E}[Y^* \mid X, A, S^{(k)}] = R^{(k)} \Rightarrow \mathbb{E}[Y^* \mid X, A_\pi(X)] = \mathbb{E}[R^{(k)}_\pi \mid X]. Take expectations over XX. S2 + L1–L2 needed only if calibration transports from different distribution.

Proposition 2 (IPS identification)

Under S1, S3 (and S2 + L1–L2 if fkf_k learned out-of-domain), from logs (X,A,S(k))π0(X, A, S^{(k)}) \sim \pi_0,

V(π)=E[wπ(X,A)R(k)],wπ(X,A):=π(AX)π0(AX)V(\pi) = \mathbb{E}[w_\pi(X, A) R^{(k)}], \qquad w_\pi(X, A) := \frac{\pi(A \mid X)}{\pi_0(A \mid X)}

Proposition 3 (DR identification)

Under S1, S3 (and S2 + L1–L2 if fkf_k learned out-of-domain). Let Qη(x,a):=E[R(k)X=x,A=a]Q_\eta(x,a) := \mathbb{E}[R^{(k)} \mid X=x, A=a] be any outcome model ("critic"). Then

V(π)=E[wπ(X,A)(R(k)Qη(X,A))+Qηπ(X)]V(\pi) = \mathbb{E}[w_\pi(X,A)(R^{(k)} - Q_\eta(X,A)) + Q_\eta^\pi(X)]

where Qηπ(X):=Eaπ(X)[Qη(X,a)]Q_\eta^\pi(X) := \mathbb{E}_{a \sim \pi(\cdot \mid X)}[Q_\eta(X,a)].

This holds even if either wπw_\pi or QηQ_\eta is misspecified (doubly robust).

3.5. Transport formulas (cross-environment evaluation)

When evaluating π\pi in a target environment that differs from the calibration source, Pearl & Bareinboim's transport framework [7] tells us exactly which target quantities to measure. Below are the three common deployment scenarios:

Case A: Covariate shift only (selection into X)

Scenario: Prompt distribution changes (new user population, different time period), but judge mechanism P(S(k)X,A)P(S^{(k)} \mid X,A) and oracle meaning are invariant.

Transport formula:

V(π)=EXP[Eaπ(X)[Q(X,a)]],Q(X,a):=EP[fk(S(k),X)X,a]V_*(\pi) = \mathbb{E}_{X \sim P_*} \left[ \mathbb{E}_{a \sim \pi(\cdot \mid X)} \left[ Q(X,a) \right] \right], \quad Q(X,a) := \mathbb{E}_P[f_k(S^{(k)}, X) \mid X, a]

What you need in target: P(X)P_*(X) (ability to draw prompts from target population). Can keep fkf_k and Q(X,a)Q(X,a) trained on source data.

Case B: Judge/measurement shift (selection into S^(k))

Scenario: Judge model changes (GPT-4.1-nano → GPT-4.5-nano), instrumentation updates, or deliberation depth increases, but prompt distribution and oracle meaning are invariant.

Transport formula:

V(π)=EXP[Eaπ(X)[ES(k)P(X,a)[fk(S(k),X)]]]V_*(\pi) = \mathbb{E}_{X \sim P} \left[ \mathbb{E}_{a \sim \pi(\cdot \mid X)} \left[ \mathbb{E}_{S^{(k)} \sim P_*(\cdot \mid X,a)} \big[ f_k(S^{(k)}, X) \big] \right] \right]

What you need in target: P(S(k)X,a)P_*(S^{(k)} \mid X,a) (new judge channel). Can keep fkf_k if S-admissibility holds (no selection into YY^*). If prompts also shift, replace PP by PP_* in the outer expectation (i.e., use Case C).

Case C: Covariate + judge shift (selection into X and S^(k))

Scenario: Both prompt distribution and judge mechanism change (e.g., deploying to new geography with different user base and updated judge model).

Transport formula:

V(π)=EXP[Eaπ(X)[ES(k)P(X,a)[fk(S(k),X)]]]V_*(\pi) = \mathbb{E}_{X \sim P_*} \left[ \mathbb{E}_{a \sim \pi(\cdot \mid X)} \left[ \mathbb{E}_{S^{(k)} \sim P_*(\cdot \mid X,a)} \big[ f_k(S^{(k)}, X) \big] \right] \right]

What you need in target: Both P(X)P_*(X) and P(S(k)X,a)P_*(S^{(k)} \mid X,a).

When transport fails: Selection into Y*

If selection points into YY^* (oracle meaning changed—e.g., safety standards shifted, evaluation criteria evolved), S-admissibility is violated and fkf_k does not transport. You must recalibrate fkf_k with new oracle labels in the target environment, or adopt the Kallus & Mao estimator (§4.6) that targets YY directly per context without assuming transport.

Selection node locationf_k transports?Required target measurementsSource pieces you keep
SXS \to X onlyP(X)P_*(X)fk,Q(X,a)f_k, Q(X,a)
SS(k)S \to S^{(k)} onlyP(S(k)X,a)P_*(S^{(k)} \mid X,a)fkf_k
SX,S(k)S \to X, S^{(k)}P(X),P(S(k)X,a)P_*(X), P_*(S^{(k)} \mid X,a)fkf_k
SYS \to Y^*New oracle labels to recalibrate

4. Estimators

Let Ioracle{1,,n}\mathcal{I}_{\text{oracle}} \subset \{1, \ldots, n\} index examples with expensive IDO labels YY^* (at a top rung one can afford); others have only S(k)S^{(k)}.

4.1. Calibrator

Estimate fkf_k on Ioracle\mathcal{I}_{\text{oracle}} by:

  • Monotone (isotonic): fk(s,x)fk(s)f_k(s,x) \equiv f_k(s) nondecreasing in ss and mean-preserving on the oracle slice.
  • Two-stage: Fit T=gk(S(k),X)T = g_k(S^{(k)}, X) (e.g., spline in (S(k),length)(S^{(k)}, \text{length})), then isotonic TYT \mapsto Y^* with mean preservation.

Note on mean preservation: Mean preservation holds on the calibration slice; after transport to new domains/policies, the mean can differ unless S2 (transport) and L1–L2 (oracle MAR/positivity) hold. Use the transport test (§6) to validate.

Use K-fold cross-fitting: train fkf_k on folds j\neq j, predict on fold jj, to obtain out-of-fold R^(k)\widehat{R}^{(k)}.

4.2. Direct (fresh draws)

With mm prompts scored under π\pi,

V^dir(π)=1mi=1mR^π,i(k)\widehat{V}_{\text{dir}}(\pi) = \frac{1}{m} \sum_{i=1}^m \widehat{R}^{(k)}_{\pi,i}

4.3. IPS (logs only)

V^IPS(π)=i=1nwπ,iR^i(k)i=1nwπ,i(self-normalized)\widehat{V}_{\text{IPS}}(\pi) = \frac{\sum_{i=1}^n w_{\pi,i} \widehat{R}^{(k)}_i}{\sum_{i=1}^n w_{\pi,i}} \quad \text{(self-normalized)}

4.4. DR (logs + critic ± fresh draws)

Fit Qη(x,a)E[R^(k)x,a]Q_\eta(x,a) \approx \mathbb{E}[\widehat{R}^{(k)} \mid x,a] via cross-fitting. If fresh draws from π\pi are available, approximate Qηπ(x)Q_\eta^\pi(x) by Monte Carlo. Then

V^DR(π)=1ni=1n[wπ,i(R^i(k)Qη(Xi,Ai))+Qηπ(Xi)]\widehat{V}_{\text{DR}}(\pi) = \frac{1}{n} \sum_{i=1}^n \left[ w_{\pi,i}(\widehat{R}^{(k)}_i - Q_\eta(X_i, A_i)) + Q_\eta^\pi(X_i) \right]

4.5. Weight stabilization (optional, off-policy)

Project raw weights to a mean-one, score-indexed monotone cone (SIM-style calibration) to boost ESS. This is a bias–variance tradeoff: stabilized weights w~π,i\tilde{w}_{\pi,i} can introduce small bias unless they converge to the true importance ratio. Use weight stabilization inside DR estimators (where outcome models guard against modest weight misspecification), and report diagnostics (ESS, tails).

4.6. K–M drop-in estimator (when transport is doubtful)

If S2 (transport) or L1–L2 (oracle MAR/positivity) are suspect, use a Kallus & Mao–style estimator that does not assume surrogacy sufficiency. This requires collecting YY labels in each evaluation context, but is robust to calibration transport failures.

Setup for two-policy comparisons (ATE) on a fixed prompt set:

  • Let T{0,1}T \in \{0,1\} denote the policy indicator (e.g., baseline vs. candidate); XX = prompts/contexts, SS = cheap judge scores, YY = oracle outcome
  • Observe (X,S)(X,S) on all examples; collect YY on an MAR-selected subset (R=1R=1)
  • Nuisances (cross-fit): propensity e(X)=P(T=1X)e(X) = P(T=1 \mid X), labeling propensity r(t,X,S)r(t,X,S) (or density ratio λ(S,X,t)\lambda(S,X,t) via offset logistic if labels very sparse), μ~(t,X,S)=E[YT=t,X,S,R=1]\tilde{\mu}(t,X,S) = \mathbb{E}[Y \mid T=t,X,S,R=1], and its projection μ(t,X)=E[μ~(t,X,S)T=t,X]\mu(t,X) = \mathbb{E}[\tilde{\mu}(t,X,S) \mid T=t,X]

Estimator: Sample-average of the K&M EIF (see comparison box in §5):

δ^=1ni=1n[μ(1,Xi)μ(0,Xi)+Tie(Xi)e(Xi)(1e(Xi))(μ~(Ti,Xi,Si)μ(Ti,Xi))+TiRie(Xi)r(1,Xi,Si)(Yiμ~(1,Xi,Si))(1Ti)Ri(1e(Xi))r(0,Xi,Si)(Yiμ~(0,Xi,Si))] \widehat{\delta} = \frac{1}{n}\sum_{i=1}^n \Bigl[ \mu(1,X_i) - \mu(0,X_i) + \frac{T_i - e(X_i)}{e(X_i)(1-e(X_i))}\bigl(\tilde{\mu}(T_i,X_i,S_i) - \mu(T_i,X_i)\bigr) + \frac{T_i R_i}{e(X_i)r(1,X_i,S_i)}\bigl(Y_i - \tilde{\mu}(1,X_i,S_i)\bigr) - \frac{(1-T_i)R_i}{(1-e(X_i))r(0,X_i,S_i)}\bigl(Y_i - \tilde{\mu}(0,X_i,S_i)\bigr) \Bigr]

Rate: N1/2N^{-1/2} if label fraction bounded away from 0; Nl1/2N_l^{-1/2} in very-sparse-labels regime. Doubly robust: consistent if either (e,re,r) or (μ~,μ\tilde{\mu},\mu) are correct. Use cross-fitting and empirical IF variance for valid inference.

5. Influence functions and inference

Assume pathwise differentiability and regularity (bounded moments, entropy conditions satisfied via cross-fitting).

5.1. Efficient influence function (EIF) for V(π)

Under S1 and known fkf_k,

ϕπ(Z)=Rπ(k)V(π),Z=(X,S(k),A if needed)\phi_\pi(Z) = R^{(k)}_\pi - V(\pi), \qquad Z = (X, S^{(k)}, A \text{ if needed})

With DR structure and nuisances η=(Qη,wπ)\eta = (Q_\eta, w_\pi),

ϕπ(Z)=wπ(X,A)(R(k)Qη(X,A))+Qηπ(X)V(π)\phi_\pi(Z) = w_\pi(X,A)(R^{(k)} - Q_\eta(X,A)) + Q_\eta^\pi(X) - V(\pi)

which is Neyman-orthogonal to first-order perturbations of (Qη,wπ)(Q_\eta, w_\pi) holding fkf_k fixed. Uncertainty from learning fkf_k on the oracle slice is added separately via OUA (§5.3). If desired, one can treat fkf_k as a nuisance and cross-fit it jointly to achieve formal orthogonality. We separate it and account for uncertainty via OUA for transparency and modularity.

5.2. Asymptotics and SEs

With K-fold cross-fitting,

n(V^(π)V(π))N(0,V[ϕπ(Z)])\sqrt{n}(\widehat{V}(\pi) - V(\pi)) \rightsquigarrow \mathcal{N}(0, \mathbb{V}[\phi_\pi(Z)])

Estimate variance with the empirical variance of ϕπ\phi_\pi (cluster-robust if needed).

5.3. Oracle-uncertainty aware (OUA) variance

If fkf_k is learned from a finite oracle slice, add delete-one-fold jackknife over oracle folds:

Var^OUA=K1Kj=1K(V^(j)(π)Vˉ)2,Vˉ=1KjV^(j)\widehat{\text{Var}}_{\text{OUA}} = \frac{K-1}{K} \sum_{j=1}^K (\widehat{V}^{(-j)}(\pi) - \bar{V})^2, \quad \bar{V} = \frac{1}{K} \sum_j \widehat{V}^{(-j)}

Total variance: Var^main+Var^OUA\widehat{\text{Var}}_{\text{main}} + \widehat{\text{Var}}_{\text{OUA}}. Use Satterthwaite df for small-sample t-intervals if desired.

Kallus & Mao EIF (without S1): for comparison

K&M estimate the ATE on Y (not on calibrated R(k)R^{(k)}), using surrogates SS to improve efficiency without assuming surrogacy sufficiency. Their setup:

  • Treatment T{0,1}T \in \{0,1\}, outcome YY, surrogate SS, covariates XX
  • Oracle label indicator R{0,1}R \in \{0,1\} with propensity r(t,X,S)=P(R=1T=t,X,S)r(t,X,S) = P(R=1 \mid T=t,X,S) (MAR)
  • Nuisances: e(X)=P(T=1X)e(X) = P(T=1 \mid X), μ~(t,X,S)=E[YT=t,X,S,R=1]\tilde{\mu}(t,X,S) = \mathbb{E}[Y \mid T=t,X,S,R=1], and its projection μ(t,X)=E[μ~(t,X,S)T=t,X]\mu(t,X) = \mathbb{E}[\tilde{\mu}(t,X,S) \mid T=t,X]

Regime 1: Balanced labels (NlNuN_l \asymp N_u). The EIF for δ=E[Y(1)Y(0)]\delta = \mathbb{E}[Y(1) - Y(0)] is:

ψ(W)=μ(1,X)μ(0,X)δprojection term+Te(X)e(X)(1e(X))(μ~(T,X,S)μ(T,X))surrogate gap+TRe(X)r(1,X,S)(Yμ~(1,X,S))(1T)R(1e(X))r(0,X,S)(Yμ~(0,X,S))missing-Y \psi(W) = \underbrace{\mu(1,X) - \mu(0,X) - \delta}_{\text{projection term}} + \underbrace{\frac{T - e(X)}{e(X)(1-e(X))}\bigl(\tilde{\mu}(T,X,S) - \mu(T,X)\bigr)}_{\text{surrogate gap}} + \underbrace{\frac{TR}{e(X)r(1,X,S)}\bigl(Y - \tilde{\mu}(1,X,S)\bigr) - \frac{(1-T)R}{(1-e(X))r(0,X,S)}\bigl(Y - \tilde{\mu}(0,X,S)\bigr)}_{\text{missing-}Y}

This is doubly robust: consistent if either (e,re,r) are correct or (μ~,μ\tilde{\mu},\mu) are correct. Cross-fitting ensures valid inference with flexible ML nuisances.

Regime 2: Very sparse labels (NlNuN_l \ll N_u). Replace 1/r(t,X,S)1/r(t,X,S) with a density ratio λ(S,X,t)/πt\lambda(S,X,t)/\pi_t (estimated via offset logistic regression). The rate becomes Nl1/2N_l^{-1/2}, showing efficiency gains from unlabeled surrogates.

Key difference from IDO: K&M target YY and do not assume S1, so they cannot "calibrate once, evaluate everywhere." To run K&M in your stack, collect YY labels in each evaluation context and estimate nuisances jointly.

6. Testable diagnostics (falsifiable implications)

  • Transport test (policy/time). Per-group residual mean test:
    H0:E[Yfk(S(k),X)G=g]=0gGH_0: \mathbb{E}[Y^* - f_k(S^{(k)}, X) \mid G=g] = 0 \quad \forall g \in \mathcal{G}
    where GG indexes groups (policies, time periods, domains). Use labeled subset; apply multiple-testing correction (e.g., Bonferroni). This is a weaker, testable implication of S-admissibility—if you lack labels in multiple domains, you can only partially test S2.
  • Coverage of surrogate support. Compare histograms of S(k)S^{(k)} on oracle-labeled vs. full sets; flag extrapolation if tails are unlabeled.
  • Overlap diagnostics (off-policy). Effective sample size ESS=(w)2/w2\text{ESS} = (\sum w)^2 / \sum w^2, weight CV, max/median ratio, Hill tail index.
  • OUA share. Report Var^OUA/(Var^main+Var^OUA)\widehat{\text{Var}}_{\text{OUA}} / (\widehat{\text{Var}}_{\text{main}} + \widehat{\text{Var}}_{\text{OUA}})to guide budget (more labels vs. more prompts).
  • Prentice test (surrogacy sufficiency / S1). On oracle-labeled subsets, regress YY^* on (X,A,S(k))(X, A, S^{(k)}) and test whether adding AA (and A×S(k)A \times S^{(k)}) improves fit. Failing to reject supports S1 (surrogacy sufficiency). For S-admissibility (S2, cross-domain), use a domain indicator GG and test Y ⁣ ⁣ ⁣GX,A,S(k)Y^* \perp \!\!\! \perp G \mid X, A, S^{(k)} on pooled labeled data across domains: does GG (and G×S(k)G \times S^{(k)}) improve prediction? If yes, fkf_k does not transport—recalibrate or use K-M estimator (§4.6).

7. Learning with the IDO objective

For parametric πθ\pi_\theta, the policy learning problem is

maxθΘV(πθ)\max_{\theta \in \Theta} V(\pi_\theta)

A plug-in gradient follows from the policy gradient identity with calibrated rewards:

θV(πθ)=E[Eaπθ(X)[θlogπθ(aX)R(k)(X,a)]]\nabla_\theta V(\pi_\theta) = \mathbb{E}\left[\mathbb{E}_{a \sim \pi_\theta(\cdot \mid X)}[\nabla_\theta \log \pi_\theta(a \mid X) R^{(k)}(X,a)]\right]

optionally replacing R(k)R^{(k)} by an advantage R(k)b(X)R^{(k)} - b(X). This "RL with calibrated reward" aligns training with IDO.

For safe deployment, maximize a lower confidence bound V(πθ)z1αSE(V^(πθ))V(\pi_\theta) - z_{1-\alpha} \cdot \text{SE}(\widehat{V}(\pi_\theta)).

8. Multiple stakeholders and social choice

Let U={1,,m}\mathcal{U} = \{1, \ldots, m\} index stakeholders with oracles Yu(x,a)[0,1]Y^*_u(x,a) \in [0,1]. A social aggregator W:[0,1]m[0,1]W: [0,1]^m \to [0,1] defines

V(π)=E[W(Y1(X,Aπ(X)),,Ym(X,Aπ(X)))]V(\pi) = \mathbb{E}[W(Y^*_1(X, A_\pi(X)), \ldots, Y^*_m(X, A_\pi(X)))]

Common choices: weighted utilitarian (W(y)=uwuyuW(y) = \sum_u w_u y_u), max-min (W(y)=minuyuW(y) = \min_u y_u), or constrained variants. Surrogacy extends with fk,u(S(k),X)E[YuX,A,S(k)]f_{k,u}(S^{(k)}, X) \approx \mathbb{E}[Y^*_u \mid X, A, S^{(k)}]; calibrate each and plug into WW.

9. The deliberation ladder as information order

Model rungs by a filtration F0FKF\mathcal{F}_0 \subset \cdots \subset \mathcal{F}_K \subseteq \mathcal{F}_\infty. Define

Y(k)(x,a):=E[Y(x,a)Fk],S(k)=any statistic measurable w.r.t. FkY^{(k)}(x,a) := \mathbb{E}[Y^*(x,a) \mid \mathcal{F}_k], \qquad S^{(k)} = \text{any statistic measurable w.r.t. } \mathcal{F}_k

Then by Blackwell/Doob ordering, kkk' \ge k implies E[(YY(k))2]E[(YY(k))2]\mathbb{E}[(Y^* - Y^{(k')})^2] \le \mathbb{E}[(Y^* - Y^{(k)})^2]. If S(k)S^{(k)} is Blackwell more informative than S(k1)S^{(k-1)}, a calibrated estimator at rung kk is (weakly) more efficient than at rung k1k-1.

10. Extension to trajectories (agents)

Let a trajectory τ=(s0,a0,,sT)\tau = (s_0, a_0, \ldots, s_T) with policy π\pi and environment PP. Define an IDO trajectory value

Y(τ)[0,1]orY(π;X)=EP,π[t=0Tγtu(st,at)X]Y^*(\tau) \in [0,1] \quad \text{or} \quad Y^*(\pi; X) = \mathbb{E}_{P,\pi}\left[\sum_{t=0}^T \gamma^t u^*(s_t, a_t) \mid X\right]

Surrogates may be terminal (ST(k)S_T^{(k)}) or stepwise (St(k)S_t^{(k)}). Direct/IPS/DR estimators extend with clustering by trajectory; sequential IPS is typically ill-conditioned, so prefer Direct or DR with trajectory-level critics.

11. Limits (scope conditions)

  • Non-regular targets. If YY^* or WW induces non-differentiable functionals (e.g., maxima, boundary problems), first-order theory fails; use selective/subsampling or shape-constrained methods.
  • Severe non-transport. If S2 fails (e.g., adversarial policy styles), base-only calibration is biased; require per-policy calibration or new oracle labels.
  • Overlap failures. If S3 fails, IPS/DR is unreliable even with stabilized weights; collect fresh draws and use Direct.

12. Minimal "assumptions ledger" (for every deployment)

CodeStatementUsed byTest / DiagnosticMitigation
S1fk:E[YX,A,S(k)]=fk(S(k),X)\exists f_k: \mathbb{E}[Y^* \mid X,A,S^{(k)}] = f_k(S^{(k)},X)AllIncremental signal; residual vs. fkAdd covariates; richer judge; higher rung
S2Y ⁣ ⁣ ⁣SX,A,S(k)Y^* \perp \!\!\! \perp S \mid X,A,S^{(k)} (S-admissibility);
fk transports when no selection node points into Y*
All (cross-environment)Per-group residual test (§6); Cross-domain Prentice test with G indicator; diagram review (§3.5)If selection into X or S(k): measure target distributions (§3.5 table).
If selection into Y*: recalibrate with target oracle labels
S3ππ0\pi \ll \pi_0 (overlap)IPS/DRESS, tail index, max/medianWeight stabilization; collect draws
A1–A3IDO well-posedAllRung stability checksClarify oracle definition; adjust W
L1LY(X,A,S(k))L \perp Y^* \mid (X,A,S^{(k)}) (Oracle MAR)All (calibration)Oracle selection independent of residualsRandomize oracle sampling; stratify by S,X
L2P(L=1X,A,S(k))>0P(L=1 \mid X,A,S^{(k)}) > 0 (Oracle positivity)All (calibration)Coverage plots; extrapolation warningsLabel tail regions; flag OOD predictions
OUAFinite oracle labelsInferenceOUA shareAdd labels if OUA dominates
NStrictly increasing normalization to [0,1]; anchored to (πlow, πhigh) (or specified benchmarks)All (comparability & reporting)Anchor stability check across releases; report raw F and anchored Y* when anchors changeRe-anchor or freeze anchors; append change log when re-anchoring

13. What you report (template)

For each π\pi:

  • V^(π)\widehat{V}(\pi) on the IDO scale with 95% CI (main + OUA), and DF rule.
  • Diagnostics: transport test p-values, ESS (if OPE/DR), OUA share, oracle coverage plots.
  • If choosing a policy: a decision with one-sided CI (safety margin).

Summary

  • Definition: V(π)=E[Y(X,Aπ(X))]V(\pi) = \mathbb{E}[Y^*(X, A_\pi(X))]
  • Mechanism: use surrogates S(k)S^{(k)} and a calibration fkf_k so that E[YX,A,S(k)]=fk(S(k),X)\mathbb{E}[Y^* \mid X,A,S^{(k)}] = f_k(S^{(k)},X)
  • Identification: Direct (fresh draws), IPS (reweight logs), DR (two chances)
  • Uncertainty: influence-function variance + oracle-learning variance (OUA)
  • Governance: multi-party WW encodes whose IDO matters and how

This turns "AI should do what you'd do with unlimited time" into a measurable target, with estimators, CIs, and failure tests you can run.

References

[1] Kallus, N., & Mao, X. (2024). On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv:2003.12408. arXiv — Semiparametric efficiency theory for surrogate-assisted treatment effect estimation under MAR assumptions.
[2] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. DOI — Foundational DML framework for valid inference with cross-fitting and Neyman orthogonality.
[3] van der Laan, M. J., & Rose, S. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data. Springer. DOI — TMLE and targeted estimation framework for causal parameters.
[4] Dudík, M., Langford, J., & Li, L. (2011). Doubly Robust Policy Evaluation and Learning. arXiv:1103.4601. arXiv — Doubly robust methods for off-policy evaluation.
[5] Blackwell, D. (1953). Equivalent Comparisons of Experiments. Annals of Mathematical Statistics, 24(2), 265–272. DOI — Foundational work on information ordering and sufficiency.
[6] Prentice, R. L. (1989). Surrogate endpoints in clinical trials: Definition and operational criteria. Statistics in Medicine, 8(4), 431–440. DOI — Original formulation of surrogate endpoint criteria in biostatistics.
[7] Pearl, J., & Bareinboim, E. (2014). External validity: From do-calculus to transportability across populations. Statistical Science, 29(4), 579–595. DOI — Formal causal framework for transportability and external validity.