Research Note
Post-Audit Drift Correction: Offset vs EIF Correction vs Refit
Transport audits tell you when calibration has drifted. This note studies the harder follow-up question: once drift is detected, what correction should be the default?
Abstract
We compare post-audit correction strategies for calibrated LLM judges under controlled transport drift. Methods include stale plug-in, global offset, policy-specific residual correction (AIPW/EIF-style first-moment update), and recent/pooled refits (monotone and two-stage). Global offset is a strict improvement over stale plug-in, but is not a robust default under structural drift. Policy-specific correction is a stronger default for first-moment drift. Refit remains necessary when residual structure is score- or covariate-conditional.
1. Problem
Let f_old(S) be a calibrator learned in a prior era and let the target estimand for policy p be theta_p = E[Y | p]. After an audit in the new era, we need to update estimates without blindly refitting on every cycle.
The key decomposition is:
theta_p = E_p[f_old(S)] + E_p[Y - f_old(S)]This identity separates the old plug-in term from the residual correction term. The correction question becomes: how rich does that residual update need to be?
2. Methods Compared
We evaluate seven methods from the experiments workspace:
old_pluginold_plus_global_offsetold_plus_policy_offset(policy-specific residual correction)recent_refit_monotonepooled_refit_monotonerecent_refit_two_stagepooled_refit_two_stage
In practice, the main operational tradeoff is between (2), (3), and recent refits.
3. EIF-Style Correction Intuition
Global offset applies one shared correction to all policies:
theta_hat_p = E_p[f_old(S)] + delta_hatPolicy-specific correction estimates the residual first moment per policy:
theta_hat_p = E_p[f_old(S)] + E_audit,p[Y - f_old(S)]This is an AIPW/EIF-style one-step correction for policy means. It captures policy-level first-moment drift that a single global offset cannot represent, while avoiding immediate full refit when drift is not strongly structural.
4. Experiment Design
We simulate two time periods (old calibration era, new evaluation era), four policy families, and controlled drift in the judge-to-oracle relationship.
Drift scenarios:
intercept_shiftslope_shiftnonlinear_shiftcovariate_interaction_shift
Audit profiles:
base_heavy(non-representative)balanced(more representative)
Main metrics are policy-mean MAE/RMSE, ranking accuracy, and transport status from audit diagnostics.
5. Empirical Pattern
- Global offset improves over stale plug-in across tested drift regimes.
- Under slope and nonlinear drift, global offset leaves substantial residual error.
- Policy-specific correction closes much of that gap and often approaches monotone refit.
- Under covariate-interaction drift, two-stage refit remains best.


6. Recommended Post-Audit Protocol
- Default: bootstrap + policy-specific augmented correction.
- Baseline comparator: single global offset.
- Escalate to refit: when residual diagnostics show structural drift.
- Use two-stage refit: when drift is covariate-conditional.
7. Reproducibility
All experiments are in the repo-level workspace (outside PyPI runtime surface):
python experiments/offset_vs_refit/offset_vs_refit_simulation.py --n-reps 60 --audit-sizes 20,50,100,200Source: github.com/cimo-labs/cje/experiments/offset_vs_refit
