CIMO LabsCIMO Labs

When CJE Works (and When It Doesn't)

CJE is not magic. It only produces valid estimates when certain assumptions hold. This page helps you verify those assumptions before you run the analysis.

The Hard Truth

If these assumptions fail badly, CJE will still produce numbers. They'll just be wrong. The diagnostics help detect violations, but some assumptions (like A0) cannot be tested statistically—they require human judgment.

Quick Checklist: Can I Use CJE?

Answer these questions before running CJE. If you answer "No" to any, see the detailed section below for remediation options.

A0

Does your oracle label (Y) measure what you actually care about?

Example: If your oracle is "human preference" but you care about "long-term user retention," A0 fails. Your SDP must capture your true welfare target.

S1

Does your surrogate (S) actually predict the oracle (Y)?

Example: If your LLM judge score has zero correlation with expert ratings, S1 fails. Check calibration curves before trusting estimates.

S2

Will your calibration hold across policies and time?

Example: If you calibrate on Policy A but evaluate Policy B, and the S→Y relationship changes between policies, S2 fails. Run transport tests.

S3

Does your logging data cover the target policy's behavior?

Example: If Policy B produces outputs that Policy A never generated, off-policy methods have no data to learn from. Check ESS and TTC coverage.

L1

Are your oracle labels sampled at random?

Example: If you only send "suspicious" samples to human review, your calibration is biased. Oracle labeling must be random or propensity-corrected.

L2

Do you have oracle labels across the full score range?

Example: If all your human labels are for "high confidence" samples (S > 0.8), you cannot calibrate the low end. Ensure coverage across the score space.

Detailed Assumption Guide

A0: The Bridge AssumptionFOUNDATIONAL

Formal statement: Optimizing for improvements in Y produces improvements in Y* (true welfare) in expectation.

Failure mode

You optimize for a bureaucratic process, not welfare. Your evaluation system becomes a "programmable proxy" divorced from value. Example: Optimizing for "sounds confident" when you want "is correct."

How to validate

  • Predictive Treatment Effects (PTE): Do improvements in Y predict improvements in long-run outcomes?
  • Expert audit: Do domain experts agree Y captures what matters?
  • Construct validity: Does your SDP rubric cover all dimensions of welfare?

Note: A0 cannot be tested statistically. It requires human judgment about your measurement procedure. See Bridge Validation Protocol.

S1: Surrogate ValidityCALIBRATION

Formal statement: The conditional expectation E[Y|S] captures the true S→Y relationship. No unmeasured confounders bias the mapping.

Failure mode

Biased policy estimates due to missing covariates. Example: Your judge gives high scores to long responses, but length isn't why users prefer them—it's the detail. If you don't condition on the right features, calibration is off.

Diagnostic

  • Calibration curve: Plot f(S) vs actual Y. Should be monotonic and well-fit.
  • Residual analysis: Residuals should not correlate with policy or context features.
  • CJE diagnostic: result.diagnostics.calibration_quality

If this fails: Add covariates to your surrogate, improve your judge rubric, or collect more diverse oracle labels.

S2: TransportSTABILITY

Formal statement: The calibration function f(S) holds across environments and time. P(Y|S) is stable across the source and target distributions.

Failure mode

Evaluator drift or distribution shift invalidates calibration. Example: You calibrate in January, but the judge model was updated in February. Or you calibrate on English but evaluate multilingual.

Diagnostic

  • Transport test: Compare calibration curves across time periods or subgroups.
  • Drift monitoring: Track E[Y|S] stability over time.
  • CJE diagnostic: result.diagnostics.transport_warning

If this fails: Recalibrate more frequently, use Continuous Causal Calibration (CCC), or stratify by subgroup.

S3: OverlapOFF-POLICY

Formal statement: Target policy actions exist in logging policy data. For all outputs the target policy might generate, the logging policy had non-zero probability.

Failure mode

Variance explosion from extreme importance weights. Example: Your target policy generates creative outputs that the conservative logging policy never produced. IPS weights blow up (1000x), making estimates unreliable.

Diagnostic

  • Effective Sample Size (ESS): If ESS < 100, you're in danger. ESS < 50 is a red flag.
  • Target-Typicality Coverage (TTC): What fraction of logged samples fall in target-typical regions?
  • Weight distribution: Check for extreme weights (>10x mean).
  • CJE diagnostic: result.diagnostics.ess, result.diagnostics.max_weight

If this fails: Use DM (Direct Method) instead of IPS/DR, or use weight clipping (but understand the bias). See Coverage-Limited Efficiency.

L1: Oracle Missing At RandomLABELING

Formal statement: Oracle labeling probability P(L=1|S,X) does not depend on unobserved factors that also affect Y. Missingness is ignorable given observed covariates.

Failure mode

Sample selection bias in oracle labels. Example: You send "hard" samples to human review, but "hardness" also predicts lower quality. Your calibration learns a biased relationship.

How to satisfy

  • Random sampling: Select oracle samples uniformly at random from the population.
  • Stratified sampling: If you must oversample some regions, record the sampling weights.
  • Propensity correction: If labeling was non-random, model P(L=1|S,X) and reweight.

Best practice: Always use random sampling for oracle labels. If you can't, document your sampling strategy and check for bias.

L2: Positivity (Label Coverage)LABELING

Formal statement: P(L=1|S,X) > 0 for all regions of the score space. Every score level has non-zero probability of receiving an oracle label.

Failure mode

Blind spots in score space. Example: You have oracle labels only for S > 0.5, so calibration for low-scoring samples is extrapolation, not interpolation.

Diagnostic

  • Coverage histogram: Plot oracle label distribution across S bins. Look for gaps.
  • Extrapolation warning: Flag any estimates outside the calibration support.

If this fails: Collect more oracle labels in underrepresented regions. If that's not possible, acknowledge uncertainty in those regions.

Which Estimator Should I Use?

Given your assumption status, here's which CJE estimator to choose:

ScenarioEstimatorWhy
Good overlap (ESS > 100), good calibrationDR (Doubly Robust)Best of both worlds. Robust to either model being wrong.
Poor overlap (ESS < 100), good calibrationDM (Direct Method)Avoids weight explosion. Relies on calibration quality.
Good overlap, uncertain calibrationIPS (Importance Sampling)Unbiased but high variance. Model-free.
Poor overlap, uncertain calibrationDon't use OPEOff-policy evaluation is unreliable. Collect on-policy data.