CIMO LabsCIMO Labs

Learning Path

Master CJE at your own pace. Start with the basics and go as deep as you need.

How to use this guide: Follow the path sequentially. Each step builds on the previous one. Stop when you have what you need—whether that's a working implementation or a deep understanding of the theory.

The Core Concept

The Deliberation Ladder

Y*

Idealized Deliberation Oracle (IDO)

What you'd decide with unlimited time, perfect information, and reflection.

Examples: Customer lifetime value, long-term health/happiness, future life satisfaction

Y

Oracle / High-Rung Outcome

Expensive but practical labels that better approximate Y*.

Examples: Expert audits, task success, long(er)-run outcomes, 30–90d retention, expert panel rating, task completion rate

S

Cheap Surrogate

Fast signals you can collect at scale.

Examples: LLM-judge scores, clicks, watch-time, quick human labels, BLEU

Everything below teaches you how to climb this ladder efficiently—calibrating cheap signals (S) to expensive outcomes (Y) so you can make decisions that serve your true objective (Y*).

5m

Short on time?

CJE in 5 Minutes

Quick overview: the problem, the solution, one concrete example, and where to go next. Read this if you want the gist before committing to the full path.

Read the 5-minute overview
1

Start: Why Your Metrics Are Lying

Zero math. Just concrete failure stories—like the "You're absolutely right!" meme that tanked developer productivity. Understand the deliberation ladder (S → Y → Y*) and why high scores on cheap metrics can predict low value on what actually matters.

Read: Your AI Metrics Are Lying to You

⏱️ 30 minutes • Zero equations • Accessible to executives and PMs

2

See CJE in Action

CJE does three things: calibrates judge scores to real outcomes, gives you honest confidence intervals, and detects when calibration drifts. See all three in one page—no installation required.

Try: CJE in 5 Minutes (Colab)

⏱️ 5 minutes • Runs in browser via Google Colab

3

Align Your Setup

Now that you've seen what CJE does, align your prompts and judges to the same target (Y*). Define a Standard Deliberation Protocol (SDP), get copy-paste templates, and avoid common pitfalls. Zero math.

Read: Aligning Generation & Evaluation

⏱️ 18 minutes • Get: SDP templates, judge rubrics, rollout plan

4

Get CJE Running

Install the package, run your first evaluation, interpret diagnostics. Most practitioners stop here—you'll have a working implementation and know when to trust it.

# Install
pip install cje-eval
# Run
from
cje
import
analyze_dataset
result = analyze_dataset(fresh_draws_dir=
"responses/"
)

⏱️ 30 minutes hands-on

✓ Checkpoint: You can stop here

At this point, you understand the problem, have aligned your prompts and judges to Y*, and can run CJE in production. You know what CJE does: calibrate cheap metrics, give honest confidence intervals, and detect when calibration breaks.

Continue if you need to explain the methods to skeptical stakeholders or understand the theoretical foundations.

arXiv:2512.11150

Prefer a single comprehensive paper? The preprint covers methodology, experiments, and formal theory in one document.

Read the Paper →
5

Alignment Theory: Formal Proofs

Formal framework for Y*-alignment: propositions (optimization compatibility, judge-pool invariance, OPE transportability), assumptions ledger, and integration with OpenAI's deliberative alignment work.

Read: Your AI Is Optimizing the Wrong Thing — Technical Appendix

⏱️ 25 minutes • Covers: calibration theory, judge-pool invariance theorem, OPE transportability

6

Economics of Alignment

Why alignment fails when fabrication is cheap and verification is expensive. How SDP raises fabrication cost via checkable commitments and lowers verification cost via structured decomposition. The Beckerian deterrence model for AI accountability.

⏱️ 30 minutes combined • Covers: F vs V economics, certificate ladder, enforcement pathways

7

Empirical Deep Dive

Complete evaluation on 5k real Arena data: all 14 estimators, ablations, diagnostics, uncertainty decomposition. Understand edge cases and when methods fail.

Read: Full Arena Experiment

⏱️ 45 minutes • Focus: estimator comparisons, OUA decomposition, transportability tests

8

Why the Methods Work

Understand the unifying principle: projection onto convex constraint sets. Why AutoCal-R and SIMCal-W reduce variance while preserving unbiasedness. When off-policy evaluation hits fundamental limits.

⏱️ 25 minutes combined • Covers: isotonic regression, mean preservation, ESS limits

9

Evaluation Theory: Formal Framework

Complete theoretical treatment: identification assumptions (S1-S4, L1-L2), influence functions, semiparametric efficiency, Pearl-Bareinboim transport theory.

Read: Technical Appendix

⏱️ 25 minutes • Covers: DM/IPS/DR estimators, cross-fitting, DML, transportability proofs, literature connections

Ready to Implement?