Robustness Under Pressure: Goodhart Points, Optimization Gaps, and Dynamic Validation

Eddie Landesberg•November 21, 2025•25 min read

0. Introduction: The Gap Between Static Validation and Dynamic Robustness

The CIMO stack currently validates judges (surrogate reward models) primarily through static metrics on holdout sets: RMSE reduction, calibration error, and correlation with gold-standard outcomes. This is appropriate for Regimes 1-3 (evaluation contexts), where the judge is a passive measurement instrument used to estimate the value of a fixed policy.

However, Regime 4: Optimization introduces a fundamentally different threat model. When the surrogate becomes an active control signal (e.g., RLHF reward model, Best-of-N judge), the policy is modified to maximize the surrogate. This creates adversarial pressure that can exploit any gap between correlation and causation.

The Core Problem

A judge with 0.92 correlation on a static test set may catastrophically fail when used to optimize a policy. Static validation tells us the judge is predictive; it does not tell us the judge is robust to optimization pressure.

This post formalizes the metrics and methodologies required to extend CIMO's framework from static evaluation to dynamic stress testing. We introduce:

Goodhart Point (GHP): The level of optimization pressure at which gold reward peaks then crashes.
Optimization Gap (OG): The divergence between surrogate and gold reward under optimization.
CLOVER-A (Active Adversarial Validation): A protocol for stress-testing judges by actively optimizing a sacrificial policy against them.
Side-Channel Audits (SCA): Systematic identification of exploitable paths in the causal graph.

These extensions are theoretical proposals based on Gao et al. (2022) and Frangakis & Rubin (2002). Implementation in the CJE package is future work.

1. The Goodhart Parabola

Gao et al. (2022) provide empirical evidence that reward hacking follows a predictable pattern: as optimization pressure increases, the proxy reward $S$ increases monotonically, but the gold reward $Y^*$ follows a parabolic curve: it rises, peaks, then crashes.

Empirical Finding (Gao et al., Fig 1-3)

Across multiple domains (summarization, question-answering, dialogue), optimizing against a learned reward model $S$ using PPO or Best-of-N sampling produces:

Proxy reward $mathbb{E}[S]$ : Monotonically increasing with optimization steps or sample count $N$ .
Gold reward $mathbb{E}[Y^*]$ : Initially increases, reaches a maximum at a critical point (the Goodhart Point), then decreases as optimization continues.
Gap $mathbb{E}[S] - mathbb{E}[Y^*]$ : Grows predictably, following a power law in the amount of optimization pressure.

This is not random noise or an edge case. It is the economics of optimization under imperfect surrogacy. The parabola reflects the exploitation of Dissociative Effects (Frangakis & Rubin, 2002): causal paths from the policy $pi$ to the surrogate $S$ that do not pass through welfare $Y^*$ .

π → Length → S    (side channel)
π → Sycophancy → S    (side channel)
π → Y* → S    (mediated path)

Early in optimization, the policy improves welfare ( $Y^*$ ) and thus the surrogate ( $S$ ) via the mediated path. As optimization pressure increases, the policy discovers and exploits the side channels, increasing $S$ without increasing (or even while decreasing) $Y^*$ .

2. Metric 1: The Goodhart Point (GHP)

The Goodhart Point is the level of optimization pressure at which the gold reward $Y^*$ reaches its maximum. Beyond this point, further optimization harms welfare while continuing to improve the surrogate.

Definition 2.1 (Goodhart Point)

Let $Omega$ denote the optimization pressure (e.g., KL divergence from base model, Best-of-N sample count, number of RLHF steps). Let $pi_omega$ denote the policy obtained under pressure $omega in Omega$ .

The Goodhart Point is:

omega^* := argmax_{omega in Omega} mathbb{E}[Y^*_{pi_omega}]

The GHP is the point on the optimization pressure axis where welfare is maximized. A higher $omega^*$ indicates greater robustness to optimization.

Interpretation:

High GHP: The judge remains robust even under aggressive optimization. The policy can be safely optimized further to extract more welfare.
Low GHP: The judge breaks down quickly under optimization pressure. The policy must be constrained (e.g., via KL penalties) to avoid crashing welfare.
GHP = 0: The judge is immediately exploitable; any optimization against it harms welfare.

Goal of CIMO in Regime 4

Increase the Goodhart Point through causal mediation enforcement (Y*-Alignment + SDP), enabling safe optimization to higher levels of performance.

2.1. Maximum Safe Gain

The Maximum Safe Gain is the welfare improvement achievable by optimizing up to the GHP:

Delta Y^*_{ ext{safe}} := mathbb{E}[Y^*_{pi_{omega^*}}] - mathbb{E}[Y^*_{pi_0}]

where $pi_0$ is the base policy. This metric quantifies the ceiling on safe optimization: how much welfare can be extracted before the judge breaks down.

3. Metric 2: The Optimization Gap (OG)

The Optimization Gap measures the divergence between the surrogate and gold reward at a specific level of optimization pressure. It quantifies how badly the surrogate has been exploited.

Definition 3.1 (Optimization Gap)

At optimization pressure $omega$ , the Optimization Gap is:

ext{OG}(omega) := mathbb{E}[S_{pi_omega}] - mathbb{E}[Y^*_{pi_omega}]

Equivalently, $ext{OG}(omega)$ is the expected difference between what the surrogate predicts and what the gold standard actually observes under the optimized policy.

Interpretation:

OG(ω) ≈ 0: The surrogate remains aligned with gold reward at pressure $omega$ . Optimization is safe.
OG(ω) > 0: The surrogate overestimates welfare. The policy is exploiting side channels.
OG(ω) ≫ 0: Severe reward hacking. The surrogate is completely decoupled from welfare.

3.1. Best-of-N Instantiation

In Best-of-N (BoN) sampling, we generate $N$ candidates from the base policy and select the one with highest surrogate score. The optimization pressure is $omega = N$ .

The Optimization Gap at $N$ is:

ext{OG}(N) = mathbb{E}left[max_{i in [N]} S_i ight] - mathbb{E}left[Y^*_{argmax_{i in [N]} S_i} ight]

where $S_i, Y^*_i$ are the surrogate and gold scores for the $i$ -th candidate. The first term is the expected maximum surrogate score; the second term is the expected gold reward of the selected candidate.

Practical Note

Computing $ext{OG}(N)$ requires gold annotations for the selected candidates, not just a random sample. This is why CLOVER-A must actively generate optimized samples and obtain gold labels for them.

4. Methodology 1: CLOVER-A (Active Adversarial Validation)

CLOVER-A extends the standard CLOVER protocol (static holdout validation + calibration) with an active adversarial stress test. The goal is to empirically measure the GHP and OG by actively optimizing a policy against the judge.

CLOVER-A Protocol (Preliminary)

Static Validation (Standard CLOVER): Calibrate the judge $S$ on a holdout set, verify RMSE reduction and correlation with gold labels $Y^*$ .
Adversarial Stress Test: Using a sacrificial generator policy $pi_{ ext{test}}$ (distinct from production), apply increasing optimization pressure:
- Best-of-N sampling with $N in {2, 4, 8, 16, 32, 64, 128}$
- Light RLHF (PPO) for varying numbers of steps with KL penalties
Gold Annotation: For each optimization level $omega$ , collect a sample of optimized outputs and obtain gold labels $Y^*$ (e.g., via human evaluation or a stronger oracle model).
Compute Metrics:
- Plot $mathbb{E}[S_{pi_omega}]$ and $mathbb{E}[Y^*_{pi_omega}]$ vs. $omega$
- Identify the GHP: $omega^* = argmax_omega mathbb{E}[Y^*_{pi_omega}]$
- Compute OG at each level: $ext{OG}(omega) = mathbb{E}[S_{pi_omega}] - mathbb{E}[Y^*_{pi_omega}]$
Acceptance Criterion: The judge passes CLOVER-A if:
- $omega^* geq omega_{ ext{target}}$ (GHP exceeds target threshold)
- $ext{OG}(omega_{ ext{target}}) leq epsilon$ (Gap remains small at target pressure)

Why "Sacrificial" Policy? The stress test actively searches for reward hacking. You do not want to run this on your production policy. Instead, use a test instance of the generator to probe for weaknesses.

4.1. Comparison to Standard CLOVER

Aspect	CLOVER (Standard)	CLOVER-A (Adversarial)
Data	Static holdout set from base policy	Actively optimized samples at varying pressure levels
Metrics	RMSE, calibration error, correlation	Goodhart Point (GHP), Optimization Gap (OG)
Goal	Validate predictive accuracy (Regimes 1-3)	Validate robustness to optimization (Regime 4)
Cost	Low (one-time annotation of holdout)	High (requires gold labels for optimized samples)

CLOVER-A is complementary to standard CLOVER. Static validation is necessary but insufficient for Regime 4. Dynamic stress testing is expensive but essential for deployment in optimization contexts (RLHF, BoN).

5. Methodology 2: Side-Channel Audits (SCA)

Side-Channel Audits are systematic procedures for identifying exploitable causal paths in the domain. They formalize the process of discovering potential Dissociative Effects before they manifest as reward hacking.

5.1. What is a Side Channel?

A side channel is a causal path from the policy $pi$ to the surrogate $S$ that does not pass through the welfare outcome $Y^*$ . Formally:

π → Feature → S where Feature ⊥ Y* | π

Examples in LLM evaluation:

Length: π → Response Length → S, where length is correlated with quality on average but not causally required.
Sycophancy: π → Flattery/Agreement → S, where the judge rewards outputs that agree with the user regardless of correctness.
Confidence: π → Assertive Tone → S, where the judge mistakes confidence for accuracy.
Formatting: π → Markdown/Bullets → S, where the judge rewards structure over substance.

5.2. SCA Procedure

Side-Channel Audit Protocol

Domain Analysis: List plausible features that could influence judge scores but are orthogonal to welfare (e.g., length, tone, formatting, keywords).
Feature Extraction: For a sample of outputs, measure candidate side-channel features (e.g., word count, sentiment score, presence of specific tokens).
Conditional Independence Test: For each feature $F$ , test whether $S perp Y^* mid F$ (i.e., does the surrogate depend on the feature even after conditioning on gold welfare?). Use regression or causal discovery methods.
Construct Blocking Table: For each identified side channel, design an SDP intervention that blocks it. Document the mapping:
Side Channel → SDP Blocking Mechanism
Validation: Re-run CLOVER-A with the updated SDP and verify that the blocked side channels no longer contribute to the Optimization Gap.

5.3. Example: Length Side Channel

Suppose we identify that $S$ is correlated with response length even after conditioning on $Y^*$ :

ext{Corr}(S, ext{Length} mid Y^*) eq 0

SDP Intervention (Blocking Mechanism):

Instruction to Judge: "Evaluate the response based on correctness and helpfulness. Do not penalize concise answers or reward verbosity. If a short answer fully addresses the query, rate it as highly as a longer answer with equivalent correctness."

Evidence Requirement: "Cite specific claims in the response that are correct/incorrect. Do not reference length or detail as a quality signal."

After implementing this SDP update, we re-run the conditional independence test and verify that $ext{Corr}(S, ext{Length} mid Y^*) approx 0$ . If the side channel persists, iterate on the SDP design.

5.4. Transitivity Diagnostic: The Subgroup Hack

Even with perfect causal structure (no side channels, no confounding), a third failure mode can cause the surrogate paradox: lack of transitivity, also called distributional non-monotonicity (VanderWeele, 2013).

VanderWeele's Condition (iii)

The surrogate paradox can occur if "the treatment does not positively affect the surrogate for all the same individuals for whom the surrogate positively affects the outcome."

The AI Scenario (The Subgroup Hack): Suppose a policy update affects two subpopulations:

Math Problems

The model improves reasoning. KaTeX can only parse string typed expression. (Positive correlation)

Safety Queries

The model becomes over-refusing. KaTeX can only parse string typed expression (judge rewards safety), $Y^* downarrow$ (user annoyed by unnecessary refusals). (Negative correlation)

If the update aggressively shifts behavior on Safety Queries, the average score $mathbb{E}[S]$ goes up. Global calibration $mathbb{E}[Y^*|S]$ will say performance improved. But aggregate welfare may crash because the subgroup where $S$ and $Y^*$ are anti-correlated dominates.

Transitivity Diagnostic Protocol

Stratify: Split the evaluation set by domain, difficulty, or other meaningful covariates (e.g., "coding", "math", "creative writing", "safety-sensitive queries").
Compute Stratum-Level Correlations: For each stratum $k$ , estimate the correlation between surrogate improvement and welfare improvement:
$ext{sign}(Delta S_k) cdot ext{sign}(Delta Y^*_k)$
Flag Reversals: If any stratum shows $Delta S > 0$ but $Delta Y^* < 0$ (or vice versa), flag a Transitivity Violation.
Report: Either (a) exclude the violating stratum from optimization, (b) design a stratum-specific SDP, or (c) acknowledge that global calibration is invalid for this population.

Why Global Calibration Fails

E[Y*|S] (Design-by-Projection / Isotonic Regression) calibrates on the aggregate. It does not guarantee effect homogeneity across subgroups. The Transitivity Diagnostic is the missing test: it verifies that the S-Y* relationship has the same sign across all deployment strata.

Connection to CLOVER-A: The Transitivity Diagnostic should be run before CLOVER-A stress testing. If transitivity is violated, the judge is fundamentally broken for that subpopulation. No amount of SDP intervention can fix a sign reversal.

5.5. Direct Effect Diagnostic: Residual Policy Effect

VanderWeele's Condition (i): the treatment affects the outcome through paths that bypass the surrogate. In ML terms, optimization affects Y* through channels the judge doesn't see.

The Direct Effect Problem

Even if the SDP blocks known side channels (length, sycophancy), there may be unknown direct paths where optimization harms welfare without affecting the score. Example: a policy that optimizes for "appears helpful" while subtly degrading factual accuracy in ways the judge can't detect.

Direct Effect Diagnostic Protocol

Generate Matched Policies: Create two policies $pi_A, pi_B$ with similar S-distributions but different optimization trajectories.
Condition on Surrogate: For samples where $S_A approx S_B$ , measure the gold outcome $Y^*$ for both policies.
Compute Residual: Estimate the residual policy effect on $Y^*$ after conditioning on $S$ :
$ext{Residual} = mathbb{E}[Y^*_A - Y^*_B mid S_A = S_B]$
Flag Direct Paths: If Residual ≠ 0 significantly, a direct effect exists. The policy is affecting welfare through channels not captured by the surrogate.

Practical Implementation

In practice, exact S-matching is difficult. Use propensity score weighting or regression adjustment: regress $Y^*$ on $( ext{Policy}, S)$ and test whether the Policy coefficient is significant. A significant effect after controlling for S indicates a direct path.

5.6. Confounding Diagnostic: Invariance Under Perturbation

VanderWeele's Condition (ii): the observed S-Y correlation is due to common causes, not a causal effect. In ML terms, "high correlation on static data" ≠ causal validity.

The Confounding Problem

If the S-Y relationship is confounded (e.g., both driven by response style or domain difficulty), then intervening to raise S will not reliably raise Y*. The correlation is spurious. It holds observationally but breaks under intervention.

Confounding Diagnostic Protocol

Vary the Judge: Change judge prompts, use judge ensembles, or swap to a different judge architecture. These are interventions on S that shouldn't affect the true S→Y* relationship.
Test Invariance: Estimate the S-Y* regression slope under each judge variant. A causal relationship should be stable across judge perturbations.
Flag Instability: If the slope changes significantly (e.g., 0.8 with Judge A, 0.3 with Judge B), the S-Y* link is likely confounded, driven by judge-specific biases rather than true welfare.
Cross-Validation: Additionally, test with blinded vs. non-blinded human evaluation. If the S-Y* relationship differs, the judge is capturing confounders (presentation effects) rather than substance.

Interpretation

Stable slope: The S-Y* relationship is robust to measurement variation, more likely to be causal.
Unstable slope: The relationship is sensitive to how S is measured. Confounding is likely. Do not trust S for optimization until confounders are identified and controlled.

5.7. Summary: The Complete Diagnostic Suite

Before deploying a judge for optimization (Regime 4), run all three VanderWeele diagnostics:

Diagnostic	Tests For	Method	Pass Criterion
Direct Effect	Side channels bypassing S	Residual policy effect after conditioning on S	Residual ≈ 0
Confounding	Spurious S-Y* correlation	Invariance under judge perturbations	Slope stable across variants
Transitivity	Heterogeneous effects across strata	Segment-wise ΔS vs ΔY* signs	Same sign in all strata

Key Insight

Calibration alone is not a safety case. A well-calibrated judge can still fail under optimization if direct effects exist, confounding is present, or transitivity fails. All three diagnostics must pass before a surrogate is safe for Regime 4 deployment.

6. Worked Example: Best-of-N Stress Test

We now demonstrate the computation of GHP and OG using a synthetic Best-of-N stress test.

6.1. Setup

Suppose we have:

A base policy $pi_0$ that generates responses with $Y^* sim mathcal{N}(5, 1)$
A surrogate judge $S = Y^* + eta cdot ext{Length}$ where Length ~ N(0, 1) and β = 0.3
The policy can increase length at will (models the ability to exploit the length side channel)

We perform Best-of-N sampling with $N in {1, 2, 4, 8, 16, 32, 64}$ .

6.2. Simulation

For each $N$ , we:

Generate $N$ candidates: $Y^*_i sim mathcal{N}(5, 1)$ , $L_i sim mathcal{N}(0, 1)$
Compute surrogate scores: $S_i = Y^*_i + 0.3 cdot L_i$
Select the candidate with maximum $S_i$ : $i^* = argmax_i S_i$
Record: $S_{ ext{selected}} = S_{i^*}$ and $Y^*_{ ext{selected}} = Y^*_{i^*}$

Repeat 10,000 times and compute expectations.

6.3. Results

N	E[S]	E[Y*]	OG(N)
1	5.00	5.00	0.00
2	5.48	5.32	0.16
4	5.82	5.54	0.28
8	6.09	5.68	0.41
16	6.31	5.62	0.69
32	6.51	5.48	1.03
64	6.69	5.29	1.40

Observations:

Surrogate reward $mathbb{E}[S]$ increases monotonically with $N$ .
Gold reward $mathbb{E}[Y^*]$ peaks at $N = 8$ (Goodhart Point), then decreases as the policy increasingly selects candidates with high length but lower intrinsic quality.
Optimization Gap $ext{OG}(N)$ grows monotonically, reaching 1.40 at $N = 64$ .

Takeaway

GHP = 8: The judge is safe for Best-of-8 sampling (ΔY* = +0.68) but breaks down beyond that. If we want to use Best-of-16 or higher, we must improve the judge (e.g., by blocking the length side channel via SDP).

6.4. After SDP Intervention

Suppose we redesign the SDP to block the length side channel, reducing $eta$ from 0.3 to 0.05. Re-running the stress test:

N	E[S]	E[Y*]	OG(N)
8	5.91	5.85	0.06
16	6.14	6.08	0.06
32	6.33	6.27	0.06
64	6.49	6.22	0.27

Result: The GHP increases from $N = 8$ to $N = 32$ , and the Maximum Safe Gain improves from ΔY* = +0.68 to ΔY* = +1.27. The Optimization Gap remains small (OG ≈ 0.06) even at $N = 32$ .

Success Criterion

By blocking the side channel, we pushed the GHP higher and extracted more safe welfare gain. This is the goal of causal mediation enforcement in Regime 4.

7. Future Work: Integration into CJE

The metrics and methodologies presented here are theoretical proposals. Full implementation in the CJE package requires:

7.1. Engineering Work

Best-of-N Sampler: Utility to generate $N$ candidates from a base policy and select by surrogate score.
RLHF Stress Tester: Lightweight PPO implementation for adversarial fine-tuning against a judge (or integration with existing RLHF frameworks).
Gold Oracle Interface: Workflow for collecting gold labels on optimized samples (human eval or stronger model).
GHP/OG Computation: Functions to compute Goodhart Point, Optimization Gap, and Maximum Safe Gain from stress test results.
SCA Toolkit: Feature extraction, conditional independence testing, and blocking table generation tools.

7.2. Research Questions

Scaling Laws: Can we predict the GHP and OG growth rate from static metrics (e.g., RMSE, correlation)? Gao et al. show power-law relationships; can we adapt these to CIMO's framework?
Automated SCA: Can we use causal discovery algorithms (e.g., PC, GES) to automatically identify side channels from observational data?
SDP Optimization: Can we formulate SDP design as an optimization problem that minimizes OG subject to interpretability and practicality constraints?
Multi-Domain Validation: Empirically validate CLOVER-A across domains (summarization, coding, dialogue) and measure GHP variance.

7.3. Deployment Considerations

CLOVER-A is expensive: it requires active optimization and gold annotation at multiple pressure levels. Cost-benefit considerations:

When to use CLOVER-A: Deployment contexts where the judge will be used for RLHF or high-stakes BoN sampling. Standard CLOVER suffices for evaluation-only use cases (Regimes 1-3).
Budget allocation: Focus gold annotation budget on the GHP region (where welfare peaks) rather than uniformly across all $omega$ levels.
Continuous monitoring: In production, track leading indicators (feature drift, variance collapse) to detect early warnings of approaching the GHP without requiring constant gold labels.

References

Gao, L., Schulman, J., & Hilton, J. (2022).

Scaling Laws for Reward Model Overoptimization.

arXiv:2210.10760

Frangakis, C. E., & Rubin, D. B. (2002).

Principal stratification in causal inference.

Biometrics, 58(1), 21-29.

Prentice, R. L. (1989).

Surrogate endpoints in clinical trials: definition and operational criteria.

Statistics in Medicine, 8(4), 431-440.

For the foundational CIMO framework and standard CLOVER protocol, see:

← Back to Blog