CIMO LabsCIMO Labs

What Should the Next Eval Dollar Buy?

Eddie Landesberg6 min read

This is a practical guide to the most important budgeting question in judge-based evaluation: should the next dollar buy more cheap judge scores, or more expensive oracle labels? CJE is built to make that choice legible instead of guesswork.

This is for teams evaluating model, prompt, or policy variants where oracle labels are expensive and judge scores are cheap. In this post, “oracle labels” means the smaller set of higher-trust labels you calibrate against, whether those come from humans or from a stronger model you trust more. If you are new to judge calibration, start with Your AI Metrics Are Lying to You. If you want the formal derivation, see Appendix F of the paper. If you want to run the workflow, open the planning notebook in Colab.

The budgeting question teams actually face

Suppose you are comparing two versions of a customer support bot on a large set of real user questions.

An LLM judge scores lots of answers because it is cheap. Humans, or a stronger review model you trust more, label a random subset because those labels are expensive.

At the end of the eval, one version looks better, but the size of the delta is still uncertain. You have budget for one more round of labeling. What should the next dollar buy?

This is the core label-budgeting problem. Not “how many samples in the abstract?” but: should you buy more judge scores across new examples, or more oracle labels on examples you have already scored?

Two kinds of uncertainty, two different purchases

CJE planning works because it separates two uncertainty sources that are often conflated in judge-based eval.

Two uncertainty sources: evaluation uncertainty from the total pool of judge-scored samples, and calibration uncertainty from the oracle-labeled subset of that same pool, balanced by CJE

The first is coverage uncertainty: you have not seen enough examples to estimate average behavior precisely.

The second is ground-truth uncertainty: you do not yet know precisely how judge scores map to the outcome you actually care about.

Those are different problems, and they are reduced by different purchases. More judge-scored examples reduce the first. More higher-trust oracle labels reduce the second.

This is why adding only cheap judge scores often fails to tighten confidence enough: calibration uncertainty can still dominate even when you have lots of judged examples.

Here is what that looks like in a simple illustrative planning scenario: a good judge, a fixed budget, cheap judge scores, and pricier oracle labels.

Simulation-backed CJE planning example showing total standard error versus oracle coverage under a fixed $5,000 budget. Standard error drops quickly from low oracle coverage to about 20 to 30 percent, then flattens and slightly rises at 50 percent.

This uses the current planning workflow directly: simulate_planning(r2=0.70, budget=5000, cost_model=CostModel(0.01, 0.16)). The exact curve will change in your setting. The point of this chart is the fixed-budget tradeoff: early oracle labels can reduce total standard error quickly, but later oracle labels buy less and can eventually crowd out breadth across examples. The planner uses that same logic to choose an interior point rather than pushing all the way to either extreme.

Practical warning

Many frameworks under-account for calibration uncertainty. That can produce narrow confidence intervals that look precise but are not decision-safe. In our benchmark paper, naive CIs on uncalibrated scores achieved 0% coverage.

How CJE turns that into a purchase decision

Once you separate the two uncertainty sources, the next purchase becomes legible. Two evals can have the same overall error bar and still imply different next moves depending on which component is dominating.

CJE combines pilot-estimated uncertainty with your cost model, then recommends an oracle/surrogate split that maximizes decision power for budget.

What should the next dollar buy?

Table scrolls horizontally on mobile.

If this dominatesWhat it meansBest next purchase
Coverage uncertaintyYou have too little breadth across examples, so average performance is still noisy.Buy more judge scores on additional examples.
Ground-truth uncertaintyYou still do not know the judge-to-outcome mapping precisely enough.Buy more oracle labels on scored examples.
Neither dominatesMarginal gains from both purchases are similar, so over-buying either one wastes budget.Split budget across both label types.

Concrete example

Suppose you already have judge scores on 10,000 support tickets, but only 50 of those tickets have higher-trust oracle labels. Buying 5,000 more judge scores probably does not fix the main bottleneck. You already have breadth. What you lack is enough oracle coverage to learn how judge scores map to the outcome you actually care about, so the next dollar should buy more oracle labels instead.

Worked scenario

Here is the same illustrative planning scenario across different budgets. This is closer to a cheaper model-oracle workflow than a fully human-review workflow. The point is not the exact numbers. The point is that the planner keeps the split stable and scales both label types together.

Table scrolls horizontally on mobile.

BudgetTotal SamplesOracle LabelsOracle %MDE
$1002,35847720.2%7.2%
$50011,7902,38820.3%3.2%
$1,00023,5814,77620.3%2.3%
$5,000117,90823,88020.3%1.0%
$10,000235,81647,76120.3%0.7%

Simulated good-judge scenario: R² = 0.70, surrogate cost = $0.01, oracle cost = $0.16, 80% power, α=0.05. MDE is for pairwise policy comparison. Use the notebook to get your own numbers from simulated or pilot-fitted variance components.

Practical workflow

The planning workflow is simple: start with simulation if you have no pilot data, then refit with pilot labels once you have them.

CJE planning workflow

  1. Start with rough assumptions: set costs and run simulate_planning or simulate_variance_model.
  2. Refine with pilot data: fit variance from real labels with fit_variance_model.
  3. Choose your planning mode: use plan_evaluation for a fixed budget or plan_for_mde for a target effect size.

Bottom line

Label budgeting is not guesswork. The question is not just how many labels to buy, but which labels to buy next. More judge scores reduce coverage uncertainty. More oracle labels reduce calibration uncertainty. CJE helps you choose the purchase that buys the most confidence per dollar.

Start planning your evaluation

Install CJE and run the planning tools on your own data. The most common mistake is still treating oracle labels as optional. The notebook makes the tradeoff explicit before you spend the budget.

$ pip install cje-eval

We welcome your feedback

CJE is actively evolving. If you have suggestions, find errors, or want to share your experience using the planning tools on real data, we'd love to hear from you.

eddie@cimolabs.com

Related Reading