CIMO LabsCIMO Labs
CJE Logo

Causal Judge Evaluation

Stop guessing. Start measuring.

Turn noisy, biased LLM-judge scores into precise, unbiased estimates of the outcomes you actually care about.

The Problem: Your Judge Is Lying

Screenshot showing Claude Code repeatedly saying 'You're absolutely right!'

Real example: Claude Code sycophancy. absolutelyright.lol

January: users loved the affirmation. Your LLM judge scored it 9/10. March: users found it cloying. The judge still gave 9/10. You shipped it. User satisfaction dropped 18%.

Raw judge scores are surrogates (S), not outcomes (Y). They fail in three ways:

1. Preference Inversion

Higher judge scores often predict lower real-world quality due to verbosity bias or sycophancy.

2. Invalid Confidence Intervals

Standard error bars assume the judge is perfect. Uncalibrated judges yield 0% coverage—your "95% CI" almost never captures the truth.

3. Scale Arbitrariness

Is 4.2 actually 5% better than 4.0? Or just noise? Without calibration, you can't know.

The Solution: Calibration

CJE treats the judge as a programmable sensor that must be calibrated against ground truth.

  1. 1. Label a small slice: Provide "oracle" labels (human expert, gold-standard, A/B outcome) for 5-25% of your data.
  2. 2. Learn the mapping: CJE learns f(S, X) → Y using isotonic regression or two-stage calibration.
  3. 3. Estimate with rigor: Apply calibration at scale with valid CIs that propagate all uncertainty.

What should my oracle be?

  • Expert audits (domain specialists rate quality 1-5)
  • A/B test outcomes (conversion, retention lift)
  • User satisfaction surveys (post-task ratings)
  • Long-term metrics (7-day retention, support tickets)

Quick Start

Install:pip install cje-eval
from cje import analyze_dataset

# Point to responses directory (one JSONL per policy)
results = analyze_dataset(fresh_draws_dir="data/responses/")

# Get unbiased estimates with valid 95% CIs
for policy, est, se in zip(
    results.metadata["target_policies"],
    results.estimates,
    results.standard_errors
):
    print(f"{policy}: {est:.3f} ± {1.96*se:.3f}")

# Output:
# model_a: 0.732 ± 0.028
# model_b: 0.689 ± 0.031
Forest plot showing policy estimates vs ground truth

What you're seeing: Blue circles = CJE estimates. Red diamonds = oracle ground truth. Error bars = 95% CIs. With n=1,000 and 25% oracle labels, intervals successfully cover the true values.

Three Modes of Evaluation

CJE selects the most rigorous estimator based on your data:

ModeData RequiredUse CaseWhy
DirectFresh responsesA/B TestingMost robust. No complex math. Just compare policies on same prompts.
IPSLogs + logprobsHistorical AnalysisFastest. Re-weigh old logs without running inference.
DRLogs + freshProduction DeployMost accurate. Doubly robust—combines both for minimum variance.

Recommendation: Start with Direct Mode. It's the most robust and requires no logprobs. Full guidance →

Validated on 5k Real Evaluations

Benchmarked 14 estimators on LMSYS Arena data with GPT-5 as oracle.

94%
Pairwise Accuracy
95%
CI Coverage
16×
Cost Reduction
Baseline comparison: Raw judges = 0% CI coverage. Standard IPS = worse than random chance.
See Full Results →

Sample Size Reference

TierOracle LabelsEval SamplesCI Width
Minimal100-200500-1k±5-8%
Recommended300-5001-2k±3-4%
Gold1,000+5k+±1-2%

When to Use CJE

✅ Use CJE if:

  • • You need high-stakes deployment decisions
  • • You suspect judge favors verbose/sycophantic answers
  • • You have a small budget for gold labels (50-100 to start)
  • • You want valid confidence intervals, not just point estimates

❌ Don't use CJE if:

  • • You have zero gold labels (can't calibrate without ground truth)
  • • Your outcome can't be defined or measured
  • • You're doing exploratory research, not deployment decisions

Check detailed assumptions →

Built-In Diagnostics

CJE tells you when to trust estimates and how to improve your setup:

Drift Monitoring

Drift monitoring

Track residuals over time. When mean ≠ 0, calibration has drifted. Re-calibrate or update your judge.

Residual Analysis

Residual scatter

Inspect large residuals to find what your judge is missing. Patterns reveal how to improve.

Next Steps

Ready to measure?

Start with the tutorial notebook or read the technical foundations.