Causal evaluation for LLM systems
The problem: You have abundant cheap metrics (LLM-as-judge, clicks, thumbs) that don't map to what you care about and are easy to game. You have expensive labels (expert audits, A/B outcomes, retention) that are robust but scarce.
The solution: Causal Judge Evaluation (CJE) leverages both—calibrate cheap metrics to expensive labels on a small sample, apply that mapping at scale, and detect when it breaks.
Read: Why Your Metrics Are Lying →What CJE Returns

Example from our LMSYS Chatbot Arena benchmark: CJE estimates (blue circles with error bars) vs. oracle ground truth (red diamonds). With n=1,000 evaluations and 25% oracle labels, 95% confidence intervals cover the true values.Note: In applied settings, you won't have oracle ground truth—that's why you need statistical inference. The red diamonds are shown here only for validation.
What CJE Does
Calibrates cheap metrics to expensive outcomes
Learn how judge scores map to real outcomes (conversion, retention, expert audits). Use that mapping at scale.
Reports honest confidence intervals
Accounts for both evaluation uncertainty and calibration uncertainty. No false wins on noisy data.
Detects when calibration breaks
Built-in drift monitoring and transportability tests. Know when your judge→outcome mapping stops working.
pip install cje-evalValidated on 5,000 Real Chatbot Arena Evaluations
Benchmarked 14 estimators on real LMSYS Arena data with GPT-5 as oracle. 94% pairwise ranking accuracy vs. 38% for raw importance sampling.
See Full Results →