What Should the Next Eval Dollar Buy?
This is a practical guide to the most important budgeting question in judge-based evaluation: should the next dollar buy more cheap judge scores, or more expensive oracle labels? CJE is built to make that choice legible instead of guesswork.
This is for teams evaluating model, prompt, or policy variants where oracle labels are expensive and judge scores are cheap. In this post, “oracle labels” means the smaller set of higher-trust labels you calibrate against, whether those come from humans or from a stronger model you trust more. If you are new to judge calibration, start with Your AI Metrics Are Lying to You. If you want the formal derivation, see Appendix F of the paper. If you want to run the workflow, open the planning notebook in Colab.
The budgeting question teams actually face
Suppose you are comparing two versions of a customer support bot on a large set of real user questions.
An LLM judge scores lots of answers because it is cheap. Humans, or a stronger review model you trust more, label a random subset because those labels are expensive.
At the end of the eval, one version looks better, but the size of the delta is still uncertain. You have budget for one more round of labeling. What should the next dollar buy?
This is the core label-budgeting problem. Not “how many samples in the abstract?” but: should you buy more judge scores across new examples, or more oracle labels on examples you have already scored?
Two kinds of uncertainty, two different purchases
CJE planning works because it separates two uncertainty sources that are often conflated in judge-based eval.
The first is coverage uncertainty: you have not seen enough examples to estimate average behavior precisely.
The second is ground-truth uncertainty: you do not yet know precisely how judge scores map to the outcome you actually care about.
Those are different problems, and they are reduced by different purchases. More judge-scored examples reduce the first. More higher-trust oracle labels reduce the second.
This is why adding only cheap judge scores often fails to tighten confidence enough: calibration uncertainty can still dominate even when you have lots of judged examples.
Here is what that looks like in a simple illustrative planning scenario: a good judge, a fixed budget, cheap judge scores, and pricier oracle labels.
This uses the current planning workflow directly: simulate_planning(r2=0.70, budget=5000, cost_model=CostModel(0.01, 0.16)). The exact curve will change in your setting. The point of this chart is the fixed-budget tradeoff: early oracle labels can reduce total standard error quickly, but later oracle labels buy less and can eventually crowd out breadth across examples. The planner uses that same logic to choose an interior point rather than pushing all the way to either extreme.
Practical warning
Many frameworks under-account for calibration uncertainty. That can produce narrow confidence intervals that look precise but are not decision-safe. In our benchmark paper, naive CIs on uncalibrated scores achieved 0% coverage.
How CJE turns that into a purchase decision
Once you separate the two uncertainty sources, the next purchase becomes legible. Two evals can have the same overall error bar and still imply different next moves depending on which component is dominating.
CJE combines pilot-estimated uncertainty with your cost model, then recommends an oracle/surrogate split that maximizes decision power for budget.
What should the next dollar buy?
Table scrolls horizontally on mobile.
| If this dominates | What it means | Best next purchase |
|---|---|---|
| Coverage uncertainty | You have too little breadth across examples, so average performance is still noisy. | Buy more judge scores on additional examples. |
| Ground-truth uncertainty | You still do not know the judge-to-outcome mapping precisely enough. | Buy more oracle labels on scored examples. |
| Neither dominates | Marginal gains from both purchases are similar, so over-buying either one wastes budget. | Split budget across both label types. |
Concrete example
Suppose you already have judge scores on 10,000 support tickets, but only 50 of those tickets have higher-trust oracle labels. Buying 5,000 more judge scores probably does not fix the main bottleneck. You already have breadth. What you lack is enough oracle coverage to learn how judge scores map to the outcome you actually care about, so the next dollar should buy more oracle labels instead.
Worked scenario
Here is the same illustrative planning scenario across different budgets. This is closer to a cheaper model-oracle workflow than a fully human-review workflow. The point is not the exact numbers. The point is that the planner keeps the split stable and scales both label types together.
Table scrolls horizontally on mobile.
| Budget | Total Samples | Oracle Labels | Oracle % | MDE |
|---|---|---|---|---|
| $100 | 2,358 | 477 | 20.2% | 7.2% |
| $500 | 11,790 | 2,388 | 20.3% | 3.2% |
| $1,000 | 23,581 | 4,776 | 20.3% | 2.3% |
| $5,000 | 117,908 | 23,880 | 20.3% | 1.0% |
| $10,000 | 235,816 | 47,761 | 20.3% | 0.7% |
Simulated good-judge scenario: R² = 0.70, surrogate cost = $0.01, oracle cost = $0.16, 80% power, α=0.05. MDE is for pairwise policy comparison. Use the notebook to get your own numbers from simulated or pilot-fitted variance components.
Practical workflow
The planning workflow is simple: start with simulation if you have no pilot data, then refit with pilot labels once you have them.
CJE planning workflow
- Start with rough assumptions: set costs and run
simulate_planningorsimulate_variance_model. - Refine with pilot data: fit variance from real labels with
fit_variance_model. - Choose your planning mode: use
plan_evaluationfor a fixed budget orplan_for_mdefor a target effect size.
Bottom line
Label budgeting is not guesswork. The question is not just how many labels to buy, but which labels to buy next. More judge scores reduce coverage uncertainty. More oracle labels reduce calibration uncertainty. CJE helps you choose the purchase that buys the most confidence per dollar.
Start planning your evaluation
Install CJE and run the planning tools on your own data. The most common mistake is still treating oracle labels as optional. The notebook makes the tradeoff explicit before you spend the budget.
We welcome your feedback
CJE is actively evolving. If you have suggestions, find errors, or want to share your experience using the planning tools on real data, we'd love to hear from you.
