Net IDO Impact: Evaluating Foundation Models Against Realistic Alternatives
A formal framework for measuring foundation model value as net advantage over realistic alternatives in dynamic task spaces, with rigorous competition-creation decomposition.
Prerequisites: This extends the IDO technical framework to competitive benchmarking. Assumes familiarity with measure theory, causal inference, and off-policy evaluation.
0. Motivation and problem setup
Standard evaluation measures policy value in isolation. But practitioners need to answer: compared to what?
A foundation model's value is not absolute—it's the net advantage over realistic alternatives(other models, software tools, human workflows) net of switching costs, latency, and learning curves. Moreover, FMs reshape the task landscape: they create new tasks (work that was previously infeasible),compete on existing tasks (work users were already doing), and may eliminate tasks(work that becomes obsolete or deprecated post-FM).
We formalize this as Net IDO Impact (NII), which decomposes into:
- Competition (exploitation): Net gains/losses on the pre-FM task distribution, re-weighted by post-FM prevalence. (Task elimination is captured via .)
- Creation (exploration): Value on genuinely new tasks that have negligible pre-FM density.
Key insight
The Lebesgue decomposition cleanly separates these two sources of value, making them individually estimable via density-ratio estimation and novelty detection.
1. Formal definitions
1.1. Spaces and measures
- Users: Measurable space with probability measure .
- Tasks: Measurable space .
- Pre-/Post-FM task measures: and on .
- Contexts and actions: Spaces , . For each , let be the context distribution (prompts, histories) when that user does that task.
Note on independence
Nothing requires and to be independent. The most general object is a joint . If you prefer the product form for simplicity, state that as an assumption; all formulas below hold with by replacing with .
1.2. IDO and policy values
- Idealized Deliberation Oracle (IDO). For each , is measurable.
- Value of a policy for :
- FM adapted policy and value. Given adaptation (prompting, fine-tuning) by for , let be the induced policy. Then
- Effective alternative value. Let be feasible alternatives (other models, software, manual workflows). With friction/switching cost measured on the same utility scale:
Optionally include a cost term inside to net out access/usage friction of the FM itself.
- Net advantage at :
- Positive/negative parts: , , with .
2. Net IDO Impact (NII)
We define three related quantities to capture gains, losses, and net welfare:
- Wins-only (gains):
- Losses:
- Balanced net welfare:
Recommendation: Report the triplet
alone hides harms. Always report to give a complete picture of welfare impacts.
3. Dynamic task space decomposition
3.1. Lebesgue decomposition
The post-FM task measure can be uniquely decomposed into a part absolutely continuous w.r.t. and a singular part:
where is the density ratio on the shared support and is the singular (creation) measure. Note:
3.2. Competition-Creation decomposition
Substituting equation (8) into equation (5):
The simplification in the creation integral (replacing with ) corresponds to assuming for . If nascent alternatives exist for new tasks, use in the creation term as well.
4. Identification via surrogacy
Under the surrogacy assumptions S1-S2 from the IDO framework, let be the calibrated reward on the IDO scale. If S-admissibility fails (i.e., selection into depends on unobserved confounders), recalibration is required; see IDO appendix §3 for transport tests.
For any policy , use Direct Method, IPS, or DR estimators to obtain from logs or fresh draws. Then:
4.1. Components needed for identification
- Alternatives menu. Specify a realistic set (e.g., another model, toolchain, human workflow) and measure friction on the utility scale. For example, , with coefficients calibrated to convert time/cost into utility.
- Density ratio. Estimate using a discriminator (e.g., logistic regression, KLIEP, Bregman divergence methods). Calibrate so that .
- Novelty / creation measure. Flag tasks with negligible pre-FM density as to estimate the creation mass .
5. Estimation
Data requirements
- Pre- and post-FM task logs with user segments and task descriptors (taxonomy or frozen embeddings)
- Judge scores and a small oracle slice for calibration (see IDO appendix)
- Friction measurements for alternatives (price, latency, onboarding) mapped to utility scale
5.1. Plug-in Monte Carlo estimator
Sampling notation: Let = number of pre-FM samples , = number of post-FM samples , and = number of post-FM samples flagged as creation tasks (i.e., ).
Use Direct/IPS/DR estimators (preferably DR for stability) to compute each . Similarly estimate:
5.2. Uncertainty quantification
Inference procedure (extends §5 of IDO appendix)
- Cross-fit all nuisances: Critics , propensities , and density ratio model .
- Bootstrap over splits: Resample pre/post task distributions for confidence intervals.
- OUA jackknife: Add oracle-uncertainty-aware variance to capture calibrator uncertainty (see IDO technical appendix §5.3).
- Clustering: If logs exhibit dependence by user or task family, use clustered bootstrap.
6. Modeling choices and assumptions
Several design decisions require explicit specification:
1. Joint distribution vs product
If the FM changes who does which tasks (e.g., creation concentrated on specific user segments), use joint rather than product .
2. Accounting for FM friction/costs
Choose one approach to avoid double-counting:
- Option A: Include inside (i.e., ), so compares net FM value to net alternative value.
- Option B: Treat FM as one candidate in and define , where alternatives exclude the FM itself.
3. Wins-only vs net
is useful for "gross gains," but it hides harms. Always report the triplet .
4. Scale of Y*
Because tasks differ wildly in stakes, be explicit about the cardinal meaning of : e.g., anchored "willingness-to-pay equivalent," "time-saved scaled by value-of-time," or revealed-preference utility calibrated via human judgments under deliberation.
5. Externalities
If tasks create spillovers (positive or negative) on others, encode them in (social welfare version) or track parallel "private utility" vs "social utility" measures.
6. Creation identification
In practice, is approximated by "tasks with negligible density under ." You'll need a pre-registered novelty detection threshold and sensitivity analysis.
Longitudinal note: If non-FM alternatives emerge for a new task later, it moves from creation into competition in subsequent time windows.
7. Worked example (discrete case)
Consider a simple discrete setting to illustrate the calculation:
Setup
- Users:
- Pre-FM tasks: (translation, 0.7), (formatting, 0.3)
- Post-FM: , plus new task with creation mass
- Check: ; ✓
Values (net of friction)
| User | Task | |||
|---|---|---|---|---|
| A | 0.85 | 0.80 | +0.05 | |
| B | 0.75 | 0.83 | −0.08 | |
| A | 0.70 | 0.50 | +0.20 | |
| B | 0.62 | 0.55 | +0.07 | |
| A | 0.60 | ~0 | +0.60 | |
| B | 0.72 | ~0 | +0.72 |
Competition weights
For :
- User A:
- User B:
For :
- User A:
- User B:
Creation weights
- User A:
- User B:
Calculation
Competition gains:
- A, :
- A, :
- B, :
Creation gains:
- A, :
- B, :
Loss (for net calculation):
- B, :
Final results
= 0.0168 + 0.0432 + 0.01008 + 0.0288 + 0.02304 = 0.12192
= 0.01792
= 0.12192 − 0.01792 = 0.104
8. What to report
NII Scoreboard (minimum reporting bundle)
- Triplet: , , (on IDO scale, with 95% CIs)
- Creation mass: and mean on creation tasks
- Competition lift: Mean on shared support weighted by
- Top negative segments: Rank cells by and report the smallest friction change that would flip them positive
- Segmented breakdowns: By user group and task family
- Sensitivity: Across novelty thresholds and friction assumptions
9. Diagnostics and guardrails
Beyond the standard IDO diagnostics (transport tests, ESS, coverage), NII requires additional checks:
1. Ratio quality
- Expected under should be close to
- Check tail behavior: plot distribution of and compute ESS
- Cross-validation: hold-out calibration check for the discriminator
2. Novelty threshold sensitivity
- Plot creation mass and creation value across a range of thresholds
- Pre-register one primary threshold before seeing post-FM data
- Use dual thresholds (conservative and liberal) for robustness bounds
3. Negative segment analysis
- Identify top cells by weighted loss:
- For each, compute the friction adjustment that would make
- Flag segments where small latency/price changes flip the decision
4. Support mismatch detection
- Verify ratio estimators are only applied on shared support
- Use density plots to visualize overlap between and
- Report percentage of post-FM mass with negligible pre-FM density
Common pitfalls
- Support mismatch: Density ratio estimators fail outside shared support. Always carve out first via novelty detection.
- Cardinality drift: Task representation must be stable across time (use frozen encoders or versioned taxonomy).
- Missing propensities: If logged behavior policies are unavailable, prefer doubly-robust estimation for .
- Time dynamics: Report NII over windows as evolves with technology diffusion.
Summary
- Target: Net advantage over realistic alternatives, decomposed into competition (wins/losses on existing tasks) and creation (value on new tasks).
- Mechanism: Lebesgue decomposition cleanly separates these sources of value.
- Identification: Combine IDO framework (calibrated surrogates for ) with density-ratio estimation () and novelty detection ().
- Reporting: Always present the triplet , creation mass, and negative segment analysis.
- Governance: Makes "compared to what?" explicit and quantifiable with honest uncertainty.
This extends the IDO measurement framework from single-policy evaluation to competitive benchmarking, turning "is this model better?" into a rigorous, measurable question with estimators, confidence intervals, and falsifiable diagnostics.
Citation
If you use this work, please cite:
BibTeX
@misc{landesberg2025nii,
author = {Landesberg, Eddie},
title = {Net IDO Impact: Evaluating Foundation Models Against Realistic Alternatives},
year = {2025},
month = {November},
url = {https://cimolabs.com/blog/net-ido-impact},
note = {CIMO Labs Technical Report}
}Plain Text
Landesberg, E. (2025). Net IDO Impact: Evaluating Foundation Models Against Realistic Alternatives. CIMO Labs Technical Report. https://cimolabs.com/blog/net-ido-impact
