CIMO LabsCIMO Labs

Research Note

Post-Audit Drift Correction: Offset vs EIF Correction vs Refit

Eddie Landesberg12 min read

Transport audits tell you when calibration has drifted. This note studies the harder follow-up question: once drift is detected, what correction should be the default?

Abstract

We compare post-audit correction strategies for calibrated LLM judges under controlled transport drift. Methods include stale plug-in, global offset, policy-specific residual correction (AIPW/EIF-style first-moment update), and recent/pooled refits (monotone and two-stage). Global offset is a strict improvement over stale plug-in, but is not a robust default under structural drift. Policy-specific correction is a stronger default for first-moment drift. Refit remains necessary when residual structure is score- or covariate-conditional.

1. Problem

Let f_old(S) be a calibrator learned in a prior era and let the target estimand for policy p be theta_p = E[Y | p]. After an audit in the new era, we need to update estimates without blindly refitting on every cycle.

The key decomposition is:

theta_p = E_p[f_old(S)] + E_p[Y - f_old(S)]

This identity separates the old plug-in term from the residual correction term. The correction question becomes: how rich does that residual update need to be?

2. Methods Compared

We evaluate seven methods from the experiments workspace:

  1. old_plugin
  2. old_plus_global_offset
  3. old_plus_policy_offset (policy-specific residual correction)
  4. recent_refit_monotone
  5. pooled_refit_monotone
  6. recent_refit_two_stage
  7. pooled_refit_two_stage

In practice, the main operational tradeoff is between (2), (3), and recent refits.

3. EIF-Style Correction Intuition

Global offset applies one shared correction to all policies:

theta_hat_p = E_p[f_old(S)] + delta_hat

Policy-specific correction estimates the residual first moment per policy:

theta_hat_p = E_p[f_old(S)] + E_audit,p[Y - f_old(S)]

This is an AIPW/EIF-style one-step correction for policy means. It captures policy-level first-moment drift that a single global offset cannot represent, while avoiding immediate full refit when drift is not strongly structural.

4. Experiment Design

We simulate two time periods (old calibration era, new evaluation era), four policy families, and controlled drift in the judge-to-oracle relationship.

Drift scenarios:

  • intercept_shift
  • slope_shift
  • nonlinear_shift
  • covariate_interaction_shift

Audit profiles:

  • base_heavy (non-representative)
  • balanced (more representative)

Main metrics are policy-mean MAE/RMSE, ranking accuracy, and transport status from audit diagnostics.

5. Empirical Pattern

  1. Global offset improves over stale plug-in across tested drift regimes.
  2. Under slope and nonlinear drift, global offset leaves substantial residual error.
  3. Policy-specific correction closes much of that gap and often approaches monotone refit.
  4. Under covariate-interaction drift, two-stage refit remains best.
Method comparison at max audit size across drift scenarios and audit profiles
At larger audit sizes, policy-specific correction and refits dominate global offset under structural drift.
MAE versus audit size comparing global offset, policy offset, and recent monotone refit
As audit size increases, policy-specific residual correction remains consistently stronger than a single global offset.

6. Recommended Post-Audit Protocol

  1. Default: bootstrap + policy-specific augmented correction.
  2. Baseline comparator: single global offset.
  3. Escalate to refit: when residual diagnostics show structural drift.
  4. Use two-stage refit: when drift is covariate-conditional.

7. Reproducibility

All experiments are in the repo-level workspace (outside PyPI runtime surface):

python experiments/offset_vs_refit/offset_vs_refit_simulation.py --n-reps 60 --audit-sizes 20,50,100,200

Source: github.com/cimo-labs/cje/experiments/offset_vs_refit