How CJE Plans Sample Size and Budget: A Practical Intro

What CJE planning solves

Every evaluation that uses judge scores faces one budgeting question: how much spend goes to oracle labels versus cheap surrogate labels?

If you under-invest in oracle labels, you get false certainty. If you over-invest, you lose scale. CJE planning gives a concrete recommendation instead of a guess.

The workflow is practical: run a pilot, choose either a fixed budget or target MDE, and let CJE return a recommended sample plan.

Why CJE needs both oracle and surrogate labels

CJE balances two uncertainty sources:

Two uncertainty sources: evaluation uncertainty from total samples and calibration uncertainty from oracle labels, balanced by CJE

In practice, this is why adding only cheap judge scores often fails to tighten confidence enough: calibration error can dominate at low oracle coverage.

Practical warning

Many frameworks under-account for calibration uncertainty. That can produce narrow confidence intervals that look precise but are not decision-safe. In our Arena benchmark, naive CIs on uncalibrated scores achieved 0% coverage.

How CJE recommends the allocation

CJE combines pilot-estimated uncertainty with your cost model, then recommends an oracle/surrogate split that maximizes decision power for budget.

Practical intuition: oracle labels are expensive but high-leverage; surrogate labels are cheap and broad. Good plans balance both instead of maximizing one side.

$Chart showing optimal oracle fraction decreasing with cost ratio$

Optimal oracle fraction as a function of cost ratio. Red dots show representative cost-ratio scenarios; at a 200× ratio, the optimal oracle fraction is still about 29%.

What this means in practice

Table scrolls horizontally on mobile.

Budget	Total Samples	Oracle Labels	Oracle %	MDE
$100	170	49	28.8%	28.5%
$500	851	245	28.8%	12.8%
$1,000	1,702	491	28.8%	9.0%
$5,000	8,512	2,457	28.9%	4.0%
$10,000	17,025	4,914	28.9%	2.9%

Assumes surrogate cost = $0.01, oracle cost = $2.00 (200× ratio), 80% power, α=0.05. MDE is for pairwise policy comparison. Your numbers will differ, so start with simulation and then fit your own σ² components from pilot labels.

Treat these numbers as an example, not a universal constant. The right split depends on your domain and pilot data.

Pick an operating point, not a magic number

The 29% oracle fraction above assumed R² = 0.85 and a 200× cost ratio. Change either input and the optimum shifts dramatically.

$Chart showing optimal oracle fraction decreasing as judge quality (R²) improves, with curves for different cost ratios$

Each curve is a different cost ratio. Read across: as your judge improves, the oracle fraction you need drops toward zero. A perfect judge (R² = 1) needs no oracle labels at all.

Find your judge's R² on the x-axis and your cost ratio among the curves — that gives you a starting point. Then validate with pilot data.

Practical workflow: simulate first, refine with pilot

The planning notebook supports a no-data start. Simulate from plausible judge quality and costs, then upgrade to pilot-fitted planning once you collect real labels.

CJE planning workflow

Quick planning (no data): set costs and run simulate_planning or simulate_variance_model.
Choose mode: fixed budget (plan_evaluation) or target MDE (plan_for_mde).
Optional refinement: fit variance from pilot labels with fit_variance_model.
Execute: run the full eval at the recommended (n, m) split.

Open the planning notebook in Colab

Want formulas and API-level details? Skip to Technical details (optional) at the bottom.

Failure modes that break decisions

Too few oracle labels

Around 50 oracle labels usually means huge calibration uncertainty and weak decision power.

All surrogate, no calibration

You get narrow intervals around a potentially biased estimate. Precision without validity is a trap.

All oracle, no scaling plan

It works once, but it does not scale across many policies. Hybrid plans keep quality while controlling cost.

No pilot yet? Use a conservative sweep

This is exactly what the planning notebook does in Step 1: start with rough judge-quality assumptions and sweep scenarios before spending on pilot labels.

Budget heatmap by R2 and oracle-surrogate cost ratio

Ballpark total budget for 5% MDE (80% power, alpha=0.05) across judge quality and cost ratios. Assumes surrogate cost = $0.01 and oracle cost = (ratio × surrogate cost).

Rule of thumb

At cost ratio 200x and judge R² around 0.7, expect roughly mid-four-figure budget to detect a 5% delta. Improve judge quality first if budgets are tight.

Technical details (optional)

If you want the formal decomposition and APIs, this is the compact version CJE uses.

Variance decomposition

Var(θ̂) = σ²_eval / n + σ²_cal / m

CJE estimates these terms from pilot data, then solves allocation under your cost model.

Square Root Law

m* / n* = √( σ²_cal · c_S ) / √( σ²_eval · c_Y )

Oracle fraction scales with the square root of uncertainty and cost terms, not linearly with cost alone.

Intuition: equalizing marginal value

The Square Root Law falls out of a simple principle: at the optimum, the last dollar spent on oracle labels and the last dollar spent on surrogate labels reduce variance by the same amount. If oracle labels are still higher-leverage per dollar, you should buy more of them (and vice versa).

Chart showing marginal value per dollar for oracle vs surrogate labels crossing at the optimal allocation

At low oracle fractions, each oracle label is far more valuable per dollar than each surrogate label. The curves cross at ~29%, the optimal allocation. Past that point, surrogates give more bang for the buck.

Code block scrolls horizontally on mobile.

# Python - planning core

from cje import (
    fit_variance_model,
    plan_for_mde,
    plan_evaluation,
    simulate_planning_sweep,
    CostModel,
)

model = fit_variance_model(pilot_data, verbose=True)
cost = CostModel(surrogate_cost=0.01, oracle_cost=2.00)

plan = plan_for_mde(target_mde=0.05, variance_model=model, cost_model=cost)
plan_budget = plan_evaluation(budget=1000, variance_model=model, cost_model=cost)
sweep = simulate_planning_sweep(
    r2_values=[0.5, 0.7, 0.9],
    budget=5000,
    cost_model=cost,
)

Bottom line

Label budgeting is not a dark art. It is a solvable optimization problem. Run a pilot, estimate your variance components, and let the planning tools give you the highest-precision allocation per dollar.

The most common mistake is still zero oracle labels. The second is token oracle coverage (<100 labels) with inflated claims. Both create false certainty and bad launch decisions.

Start planning your evaluation

Install CJE and run the planning tools on your own data. The pilot takes a day; the planning takes minutes.

$ pip install cje-eval

GitHub→Paper (Appendix I)→Why This Matters→

We welcome your feedback

CJE is actively evolving. If you have suggestions, find errors, or want to share your experience using the planning tools on real data, we'd love to hear from you.

eddie@cimolabs.com

How CJE Plans Sample Size and Budget