
Causal Judge Evaluation
Stop guessing. Start measuring.
Turn noisy, biased LLM-judge scores into precise, unbiased estimates of the outcomes you actually care about.
The Problem: Your Judge Is Lying

Real example: Claude Code sycophancy. absolutelyright.lol
January: users loved the affirmation. Your LLM judge scored it 9/10. March: users found it cloying. The judge still gave 9/10. You shipped it. User satisfaction dropped 18%.
Raw judge scores are surrogates (S), not outcomes (Y). They fail in three ways:
1. Preference Inversion
Higher judge scores often predict lower real-world quality due to verbosity bias or sycophancy.
2. Invalid Confidence Intervals
Standard error bars assume the judge is perfect. Uncalibrated judges yield 0% coverage—your "95% CI" almost never captures the truth.
3. Scale Arbitrariness
Is 4.2 actually 5% better than 4.0? Or just noise? Without calibration, you can't know.
The Solution: Calibration
CJE treats the judge as a programmable sensor that must be calibrated against ground truth.
- 1. Label a small slice: Provide "oracle" labels (human expert, gold-standard, A/B outcome) for 5-25% of your data.
- 2. Learn the mapping: CJE learns
f(S, X) → Yusing isotonic regression or two-stage calibration. - 3. Estimate with rigor: Apply calibration at scale with valid CIs that propagate all uncertainty.
What should my oracle be?
- Expert audits (domain specialists rate quality 1-5)
- A/B test outcomes (conversion, retention lift)
- User satisfaction surveys (post-task ratings)
- Long-term metrics (7-day retention, support tickets)
Quick Start
pip install cje-evalfrom cje import analyze_dataset
# Point to responses directory (one JSONL per policy)
results = analyze_dataset(fresh_draws_dir="data/responses/")
# Get unbiased estimates with valid 95% CIs
for policy, est, se in zip(
results.metadata["target_policies"],
results.estimates,
results.standard_errors
):
print(f"{policy}: {est:.3f} ± {1.96*se:.3f}")
# Output:
# model_a: 0.732 ± 0.028
# model_b: 0.689 ± 0.031
What you're seeing: Blue circles = CJE estimates. Red diamonds = oracle ground truth. Error bars = 95% CIs. With n=1,000 and 25% oracle labels, intervals successfully cover the true values.
Three Modes of Evaluation
CJE selects the most rigorous estimator based on your data:
| Mode | Data Required | Use Case | Why |
|---|---|---|---|
| Direct | Fresh responses | A/B Testing | Most robust. No complex math. Just compare policies on same prompts. |
| IPS | Logs + logprobs | Historical Analysis | Fastest. Re-weigh old logs without running inference. |
| DR | Logs + fresh | Production Deploy | Most accurate. Doubly robust—combines both for minimum variance. |
Recommendation: Start with Direct Mode. It's the most robust and requires no logprobs. Full guidance →
Validated on 5k Real Evaluations
Benchmarked 14 estimators on LMSYS Arena data with GPT-5 as oracle.
Sample Size Reference
| Tier | Oracle Labels | Eval Samples | CI Width |
|---|---|---|---|
| Minimal | 100-200 | 500-1k | ±5-8% |
| Recommended | 300-500 | 1-2k | ±3-4% |
| Gold | 1,000+ | 5k+ | ±1-2% |
When to Use CJE
✅ Use CJE if:
- • You need high-stakes deployment decisions
- • You suspect judge favors verbose/sycophantic answers
- • You have a small budget for gold labels (50-100 to start)
- • You want valid confidence intervals, not just point estimates
❌ Don't use CJE if:
- • You have zero gold labels (can't calibrate without ground truth)
- • Your outcome can't be defined or measured
- • You're doing exploratory research, not deployment decisions
Built-In Diagnostics
CJE tells you when to trust estimates and how to improve your setup:
Drift Monitoring
Track residuals over time. When mean ≠ 0, calibration has drifted. Re-calibrate or update your judge.
Residual Analysis
Inspect large residuals to find what your judge is missing. Patterns reveal how to improve.
Next Steps
Ready to measure?
Start with the tutorial notebook or read the technical foundations.
