Net IDO Impact: Evaluating Foundation Models Against Realistic Alternatives

Prerequisites: This extends the IDO technical framework to competitive benchmarking. Assumes familiarity with measure theory, causal inference, and off-policy evaluation.

0. Motivation and problem setup

Standard evaluation measures policy value $V(\pi) = \mathbb{E}_{X}\mathbb{E}_{a \sim \pi(\cdot \mid X)}[Y^*(X,a)]$ in isolation. But practitioners need to answer: compared to what?

A foundation model's value is not absolute—it's the net advantage over realistic alternatives(other models, software tools, human workflows) net of switching costs, latency, and learning curves. Moreover, FMs reshape the task landscape: they create new tasks (work that was previously infeasible),compete on existing tasks (work users were already doing), and may eliminate tasks(work that becomes obsolete or deprecated post-FM).

We formalize this as Net IDO Impact (NII), which decomposes into:

Competition (exploitation): Net gains/losses on the pre-FM task distribution, re-weighted by post-FM prevalence. (Task elimination is captured via $w(t) \approx 0$ .)
Creation (exploration): Value on genuinely new tasks that have negligible pre-FM density.

Key insight

The Lebesgue decomposition $dP_T^{FM} = w(t)\,dP_T^{0}(t) + dP_T^{\perp}(t)$ cleanly separates these two sources of value, making them individually estimable via density-ratio estimation and novelty detection.

1. Formal definitions

1.1. Spaces and measures

Users: Measurable space $(\mathcal{U}, \Sigma_{\mathcal{U}})$ with probability measure $P_U$ .
Tasks: Measurable space $(\mathcal{T}, \Sigma_{\mathcal{T}})$ .
Pre-/Post-FM task measures: $P_T^{0}$ and $P_T^{FM}$ on $(\mathcal{T}, \Sigma_{\mathcal{T}})$ .
Contexts and actions: Spaces $\mathcal{X}$ , $\mathcal{A}$ . For each $(u,t)$ , let $P_{X \mid u,t}$ be the context distribution (prompts, histories) when that user does that task.

Note on independence

Nothing requires $U$ and $T$ to be independent. The most general object is a joint $P_{U,T}^{FM}$ . If you prefer the product form $P_U \otimes P_T^{FM}$ for simplicity, state that as an assumption; all formulas below hold with $P_{U,T}^{FM}$ by replacing $P_U(u)\,dP_T^{FM}(t)$ with $dP_{U,T}^{FM}(u,t)$ .

1.2. IDO and policy values

Idealized Deliberation Oracle (IDO). For each $(u,t)$ , $Y^*_{u,t}: \mathcal{X} \times \mathcal{A} \to [0,1]$ is measurable.
Value of a policy $\pi$ for $(u,t)$ :
$V(\pi; u,t) = \mathbb{E}_{x \sim P_{X \mid u,t}} \; \mathbb{E}_{a \sim \pi(\cdot \mid x)} \big[ Y^*_{u,t}(x,a) \big]. \tag{1}$
FM adapted policy and value. Given adaptation (prompting, fine-tuning) by $u$ for $t$ , let $\pi_{FM,u,t}$ be the induced policy. Then
$V_{FM}(u,t) = V(\pi_{FM,u,t}; u,t). \tag{2}$
Effective alternative value. Let $\mathcal{Alt}(u,t)$ be feasible alternatives (other models, software, manual workflows). With friction/switching cost $C_s(\pi; u,t) \in [0,1]$ measured on the same utility scale:
$V_{alt}(u,t) = \sup_{\pi \in \mathcal{Alt}(u,t)} \Big\{ V(\pi; u,t) - C_s(\pi; u,t) \Big\}. \tag{3}$
Optionally include a cost term $C_{FM}(u,t)$ inside $V_{FM}$ to net out access/usage friction of the FM itself.
Net advantage at $(u,t)$ :
$\Delta(u,t) = V_{FM}(u,t) - V_{alt}(u,t). \tag{4}$
Positive/negative parts: $(x)_+ = \max(0,x)$ , $(x)_- = \max(0,-x)$ , with $x = x_+ - x_-$ .

2. Net IDO Impact (NII)

We define three related quantities to capture gains, losses, and net welfare:

Wins-only (gains):
$\mathrm{NII}_{\text{gain}} = \mathbb{E}_{(u,t) \sim P_U \otimes P_T^{FM}} \Big[ \big(\Delta(u,t)\big)_+ \Big]. \tag{5}$
Losses:
$\mathrm{NII}_{\text{loss}} = \mathbb{E}_{(u,t) \sim P_U \otimes P_T^{FM}} \Big[ \big(\Delta(u,t)\big)_- \Big]. \tag{6}$
Balanced net welfare:
$\mathrm{NII}_{\text{net}} = \mathbb{E}_{(u,t) \sim P_U \otimes P_T^{FM}} \Big[ \Delta(u,t) \Big] = \mathrm{NII}_{\text{gain}} - \mathrm{NII}_{\text{loss}}. \tag{7}$

Recommendation: Report the triplet

$\mathrm{NII}_{\text{gain}}$ alone hides harms. Always report $(\mathrm{NII}_{\text{gain}}, \mathrm{NII}_{\text{loss}}, \mathrm{NII}_{\text{net}})$ to give a complete picture of welfare impacts.

3. Dynamic task space decomposition

3.1. Lebesgue decomposition

The post-FM task measure can be uniquely decomposed into a part absolutely continuous w.r.t. $P_T^{0}$ and a singular part:

dP_T^{FM}(t) = w(t)\,dP_T^{0}(t) + dP_T^{\perp}(t), \quad P_T^{\perp} \perp P_T^{0}, \tag{8}

where $w = \frac{dP_T^{FM}}{dP_T^{0}}$ is the density ratio on the shared support and $P_T^{\perp}$ is the singular (creation) measure. Note:

\int_{\mathcal{T}} w(t)\,dP_T^{0}(t) = 1 - P_T^{\perp}(\mathcal{T}).

3.2. Competition-Creation decomposition

Substituting equation (8) into equation (5):

\boxed{ \mathrm{NII}_{\text{gain}} = \underbrace{ \mathbb{E}_{u \sim P_U} \left[ \int_{\mathcal{T}} \big(\Delta(u,t)\big)_+ \, w(t) \, dP_T^{0}(t) \right] }_{\text{Competition (exploitation)}} \;+\; \underbrace{ \mathbb{E}_{u \sim P_U} \left[ \int_{\mathcal{T}} V_{FM}(u,t) \, dP_T^{\perp}(t) \right] }_{\text{Creation (exploration)}} } \tag{9}

The simplification in the creation integral (replacing $(\Delta)_+$ with $V_{FM}$ ) corresponds to assuming $V_{alt}(u,t) \approx 0$ for $t \in \operatorname{supp}(P_T^{\perp})$ . If nascent alternatives exist for new tasks, use $V_{FM} - V_{alt}$ in the creation term as well.

4. Identification via surrogacy

Under the surrogacy assumptions S1-S2 from the IDO framework, let $R^{(k)} = f_k(S^{(k)}, X)$ be the calibrated reward on the IDO scale. If S-admissibility fails (i.e., selection into $Y^*$ depends on unobserved confounders), recalibration is required; see IDO appendix §3 for transport tests.

For any policy $\pi$ , use Direct Method, IPS, or DR estimators to obtain $\widehat{V}(\pi; u,t)$ from logs or fresh draws. Then:

\widehat{\Delta}(u,t) = \widehat{V}_{FM}(u,t) - \widehat{V}_{alt}(u,t).

4.1. Components needed for identification

Alternatives menu. Specify a realistic set $\mathcal{Alt}(u,t)$ (e.g., another model, toolchain, human workflow) and measure friction $C_s$ on the $[0,1]$ utility scale. For example, $C_s = \alpha \cdot \text{latency} + \beta \cdot \text{price} + \gamma \cdot \text{onboarding}$ , with coefficients calibrated to convert time/cost into utility.
Density ratio. Estimate $\hat{w}(t) \approx \frac{dP_T^{FM}}{dP_T^{0}}(t)$ using a discriminator (e.g., logistic regression, KLIEP, Bregman divergence methods). Calibrate so that $\int \hat{w}\,dP_T^{0} \approx 1 - \widehat{M}_{\text{create}}$ .
Novelty / creation measure. Flag tasks with negligible pre-FM density as $\widehat{\mathcal{T}}_{\perp}$ to estimate the creation mass $\widehat{M}_{\text{create}} = P_T^{FM}(\widehat{\mathcal{T}}_{\perp})$ .

5. Estimation

Data requirements

Pre- and post-FM task logs with user segments $u$ and task descriptors $t$ (taxonomy or frozen embeddings)
Judge scores $S^{(k)}$ and a small oracle slice for $Y^*$ calibration (see IDO appendix)
Friction measurements $C_s(\pi; u,t)$ for alternatives (price, latency, onboarding) mapped to $[0,1]$ utility scale

5.1. Plug-in Monte Carlo estimator

Sampling notation: Let $N_0$ = number of pre-FM samples $(u_i, t_i) \sim P_U \otimes P_T^{0}$ , $N_1$ = number of post-FM samples $(u_j, t_j) \sim P_U \otimes P_T^{FM}$ , and $N_\perp$ = number of post-FM samples flagged as creation tasks (i.e., $t_j \in \widehat{\mathcal{T}}_\perp$ ).

\widehat{\mathrm{NII}}_{\text{gain}} = \frac{1}{N_0} \sum_{i=1}^{N_0} \big(\widehat{\Delta}(u_i, t_i)\big)_+ \, \hat{w}(t_i) \;+\; \widehat{M}_{\text{create}} \cdot \left(\frac{1}{N_\perp}\sum_{j: t_j \in \widehat{\mathcal{T}}_\perp} \widehat{V}_{FM}(u_j, t_j)\right), \quad \widehat{M}_{\text{create}} = \frac{N_\perp}{N_1}. \tag{10}

Use Direct/IPS/DR estimators (preferably DR for stability) to compute each $\widehat{V}$ . Similarly estimate:

\widehat{\mathrm{NII}}_{\text{loss}} = \frac{1}{N_0} \sum_{i=1}^{N_0} \big(\widehat{\Delta}(u_i, t_i)\big)_- \, \hat{w}(t_i), \quad \widehat{\mathrm{NII}}_{\text{net}} = \widehat{\mathrm{NII}}_{\text{gain}} - \widehat{\mathrm{NII}}_{\text{loss}}.

5.2. Uncertainty quantification

Inference procedure (extends §5 of IDO appendix)

Cross-fit all nuisances: Critics $f_k$ , propensities $\pi_0$ , and density ratio model $\hat{w}$ .
Bootstrap over splits: Resample pre/post task distributions for confidence intervals.
OUA jackknife: Add oracle-uncertainty-aware variance to capture calibrator uncertainty (see IDO technical appendix §5.3).
Clustering: If logs exhibit dependence by user or task family, use clustered bootstrap.

6. Modeling choices and assumptions

Several design decisions require explicit specification:

1. Joint distribution vs product

If the FM changes who does which tasks (e.g., creation concentrated on specific user segments), use joint $P_{U,T}^{FM}$ rather than product $P_U \otimes P_T^{FM}$ .

2. Accounting for FM friction/costs

Choose one approach to avoid double-counting:

Option A: Include $C_{FM}(u,t)$ inside $V_{FM}$ (i.e., $V_{FM} = V(\pi_{FM}) - C_{FM}$ ), so $\Delta$ compares net FM value to net alternative value.
Option B: Treat FM as one candidate in $\mathcal{Alt}$ and define $\Delta = V_{FM} - \sup_{\pi \in \mathcal{Alt}_{\text{non-FM}}} \{V(\pi) - C_s(\pi)\}$ , where alternatives exclude the FM itself.

3. Wins-only vs net

$\mathrm{NII}_{\text{gain}}$ is useful for "gross gains," but it hides harms. Always report the triplet $(\mathrm{NII}_{\text{gain}}, \mathrm{NII}_{\text{loss}}, \mathrm{NII}_{\text{net}})$ .

4. Scale of Y*

Because tasks differ wildly in stakes, be explicit about the cardinal meaning of $[0,1]$ : e.g., anchored "willingness-to-pay equivalent," "time-saved scaled by value-of-time," or revealed-preference utility calibrated via human judgments under deliberation.

5. Externalities

If tasks create spillovers (positive or negative) on others, encode them in $Y^*_{u,t}$ (social welfare version) or track parallel "private utility" vs "social utility" measures.

6. Creation identification

In practice, $P_T^{\perp}$ is approximated by "tasks with negligible density under $P_T^{0}$ ." You'll need a pre-registered novelty detection threshold and sensitivity analysis.

Longitudinal note: If non-FM alternatives emerge for a new task later, it moves from creation into competition in subsequent time windows.

7. Worked example (discrete case)

Consider a simple discrete setting to illustrate the calculation:

Setup

Users: $P_U(A) = 0.6, \; P_U(B) = 0.4$
Pre-FM tasks: $t_1$ (translation, 0.7), $t_2$ (formatting, 0.3)
Post-FM: $w(t_1) = 0.8, \; w(t_2) = 1.2$ , plus new task $t_3$ with creation mass $P_T^{\perp}(\{t_3\}) = 0.08$
Check: $0.7 \times 0.8 + 0.3 \times 1.2 = 0.92$ ; $0.92 + 0.08 = 1.0$ ✓

Values (net of friction)

User	Task	$V_{FM}$	$V_{alt}$	$\Delta$
A	$t_1$	0.85	0.80	+0.05
B	$t_1$	0.75	0.83	−0.08
A	$t_2$	0.70	0.50	+0.20
B	$t_2$	0.62	0.55	+0.07
A	$t_3$	0.60	~0	+0.60
B	$t_3$	0.72	~0	+0.72

Competition weights

For $t_1$ : $0.7 \times 0.8 = 0.56$

User A: $0.6 \times 0.56 = 0.336$
User B: $0.4 \times 0.56 = 0.224$

For $t_2$ : $0.3 \times 1.2 = 0.36$

User A: $0.6 \times 0.36 = 0.216$
User B: $0.4 \times 0.36 = 0.144$

Creation weights

$P_T^{\perp}(\{t_3\}) = 0.08$

User A: $0.6 \times 0.08 = 0.048$
User B: $0.4 \times 0.08 = 0.032$

Calculation

Competition gains:

A, $t_1$ : $0.336 \times 0.05 = 0.0168$
A, $t_2$ : $0.216 \times 0.20 = 0.0432$
B, $t_2$ : $0.144 \times 0.07 = 0.01008$

Creation gains:

A, $t_3$ : $0.048 \times 0.60 = 0.0288$
B, $t_3$ : $0.032 \times 0.72 = 0.02304$

Loss (for net calculation):

B, $t_1$ : $0.224 \times 0.08 = 0.01792$

Final results

$\mathrm{NII}_{\text{gain}}$ = 0.0168 + 0.0432 + 0.01008 + 0.0288 + 0.02304 = 0.12192

$\mathrm{NII}_{\text{loss}}$ = 0.01792

$\mathrm{NII}_{\text{net}}$ = 0.12192 − 0.01792 = 0.104

8. What to report

NII Scoreboard (minimum reporting bundle)

Triplet: $\mathrm{NII}_{\text{gain}}$ , $\mathrm{NII}_{\text{loss}}$ , $\mathrm{NII}_{\text{net}}$ (on IDO scale, with 95% CIs)
Creation mass: $M_{\text{create}} = P_T^{\perp}(\mathcal{T})$ and mean $V_{FM}$ on creation tasks
Competition lift: Mean $(\Delta)_+$ on shared support weighted by $w(t)$
Top negative segments: Rank $(u,t)$ cells by $(\Delta)_- \cdot w(t)$ and report the smallest friction change that would flip them positive
Segmented breakdowns: By user group and task family
Sensitivity: Across novelty thresholds and friction assumptions

9. Diagnostics and guardrails

Beyond the standard IDO diagnostics (transport tests, ESS, coverage), NII requires additional checks:

1. Ratio quality

Expected $\hat{w}$ under $P_T^{0}$ should be close to $1 - \widehat{M}_{\text{create}}$
Check tail behavior: plot distribution of $\hat{w}$ and compute ESS
Cross-validation: hold-out calibration check for the discriminator

2. Novelty threshold sensitivity

Plot creation mass and creation value across a range of thresholds
Pre-register one primary threshold before seeing post-FM data
Use dual thresholds (conservative and liberal) for robustness bounds

3. Negative segment analysis

Identify top $(u,t)$ cells by weighted loss: $(\Delta)_- \cdot w(t)$
For each, compute the friction adjustment that would make $\Delta \geq 0$
Flag segments where small latency/price changes flip the decision

4. Support mismatch detection

Verify ratio estimators are only applied on shared support
Use density plots to visualize overlap between $P_T^{0}$ and $P_T^{FM}$
Report percentage of post-FM mass with negligible pre-FM density

Common pitfalls

Support mismatch: Density ratio estimators fail outside shared support. Always carve out $P_T^{\perp}$ first via novelty detection.
Cardinality drift: Task representation $t$ must be stable across time (use frozen encoders or versioned taxonomy).
Missing propensities: If logged behavior policies are unavailable, prefer doubly-robust estimation for $V$ .
Time dynamics: Report NII over windows as $P_T^{FM}$ evolves with technology diffusion.

Summary

Target: Net advantage over realistic alternatives, decomposed into competition (wins/losses on existing tasks) and creation (value on new tasks).
Mechanism: Lebesgue decomposition $dP_T^{FM} = w(t)\,dP_T^{0} + dP_T^{\perp}$ cleanly separates these sources of value.
Identification: Combine IDO framework (calibrated surrogates for $V$ ) with density-ratio estimation ( $w$ ) and novelty detection ( $P_T^{\perp}$ ).
Reporting: Always present the triplet $(\mathrm{NII}_{\text{gain}}, \mathrm{NII}_{\text{loss}}, \mathrm{NII}_{\text{net}})$ , creation mass, and negative segment analysis.
Governance: Makes "compared to what?" explicit and quantifiable with honest uncertainty.

This extends the IDO measurement framework from single-policy evaluation to competitive benchmarking, turning "is this model better?" into a rigorous, measurable question with estimators, confidence intervals, and falsifiable diagnostics.

Citation

If you use this work, please cite:

BibTeX

@misc{landesberg2025nii,
  author = {Landesberg, Eddie},
  title = {Net IDO Impact: Evaluating Foundation Models Against Realistic Alternatives},
  year = {2025},
  month = {November},
  url = {https://cimolabs.com/blog/net-ido-impact},
  note = {CIMO Labs Technical Report}
}

Plain Text

Landesberg, E. (2025). Net IDO Impact: Evaluating Foundation Models Against Realistic Alternatives. CIMO Labs Technical Report. https://cimolabs.com/blog/net-ido-impact

References

[1] Kallus, N., & Mao, X. (2024). On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv:2003.12408. arXiv

[2] Pearl, J., & Bareinboim, E. (2014). External validity: From do-calculus to transportability across populations. Statistical Science, 29(4), 579–595. DOI

[3] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. DOI

[4] Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density Ratio Estimation in Machine Learning. Cambridge University Press. DOI

← Back to IDO technical appendix