CIMO LabsCIMO Labs
← Back to IDO technical appendix

Net IDO Impact: Evaluating Foundation Models Against Realistic Alternatives

Eddie Landesberg30 min read

A formal framework for measuring foundation model value as net advantage over realistic alternatives in dynamic task spaces, with rigorous competition-creation decomposition.

Prerequisites: This extends the IDO technical framework to competitive benchmarking. Assumes familiarity with measure theory, causal inference, and off-policy evaluation.

0. Motivation and problem setup

Standard evaluation measures policy value V(π)=EXEaπ(X)[Y(X,a)]V(\pi) = \mathbb{E}_{X}\mathbb{E}_{a \sim \pi(\cdot \mid X)}[Y^*(X,a)]in isolation. But practitioners need to answer: compared to what?

A foundation model's value is not absolute—it's the net advantage over realistic alternatives(other models, software tools, human workflows) net of switching costs, latency, and learning curves. Moreover, FMs reshape the task landscape: they create new tasks (work that was previously infeasible),compete on existing tasks (work users were already doing), and may eliminate tasks(work that becomes obsolete or deprecated post-FM).

We formalize this as Net IDO Impact (NII), which decomposes into:

  • Competition (exploitation): Net gains/losses on the pre-FM task distribution, re-weighted by post-FM prevalence. (Task elimination is captured via w(t)0w(t) \approx 0.)
  • Creation (exploration): Value on genuinely new tasks that have negligible pre-FM density.

Key insight

The Lebesgue decomposition dPTFM=w(t)dPT0(t)+dPT(t)dP_T^{FM} = w(t)\,dP_T^{0}(t) + dP_T^{\perp}(t)cleanly separates these two sources of value, making them individually estimable via density-ratio estimation and novelty detection.

1. Formal definitions

1.1. Spaces and measures

  • Users: Measurable space (U,ΣU)(\mathcal{U}, \Sigma_{\mathcal{U}})with probability measure PUP_U.
  • Tasks: Measurable space (T,ΣT)(\mathcal{T}, \Sigma_{\mathcal{T}}).
  • Pre-/Post-FM task measures: PT0P_T^{0} and PTFMP_T^{FM} on (T,ΣT)(\mathcal{T}, \Sigma_{\mathcal{T}}).
  • Contexts and actions: Spaces X\mathcal{X}, A\mathcal{A}. For each (u,t)(u,t), let PXu,tP_{X \mid u,t} be the context distribution (prompts, histories) when that user does that task.

Note on independence

Nothing requires UU and TT to be independent. The most general object is a joint PU,TFMP_{U,T}^{FM}. If you prefer the product form PUPTFMP_U \otimes P_T^{FM} for simplicity, state that as an assumption; all formulas below hold with PU,TFMP_{U,T}^{FM} by replacing PU(u)dPTFM(t)P_U(u)\,dP_T^{FM}(t) with dPU,TFM(u,t)dP_{U,T}^{FM}(u,t).

1.2. IDO and policy values

  • Idealized Deliberation Oracle (IDO). For each (u,t)(u,t), Yu,t:X×A[0,1]Y^*_{u,t}: \mathcal{X} \times \mathcal{A} \to [0,1] is measurable.
  • Value of a policy π\pi for (u,t)(u,t):
    V(π;u,t)=ExPXu,t  Eaπ(x)[Yu,t(x,a)].(1) V(\pi; u,t) = \mathbb{E}_{x \sim P_{X \mid u,t}} \; \mathbb{E}_{a \sim \pi(\cdot \mid x)} \big[ Y^*_{u,t}(x,a) \big]. \tag{1}
  • FM adapted policy and value. Given adaptation (prompting, fine-tuning) by uu for tt, let πFM,u,t\pi_{FM,u,t} be the induced policy. Then
    VFM(u,t)=V(πFM,u,t;u,t).(2) V_{FM}(u,t) = V(\pi_{FM,u,t}; u,t). \tag{2}
  • Effective alternative value. Let Alt(u,t)\mathcal{Alt}(u,t) be feasible alternatives (other models, software, manual workflows). With friction/switching cost Cs(π;u,t)[0,1]C_s(\pi; u,t) \in [0,1] measured on the same utility scale:
    Valt(u,t)=supπAlt(u,t){V(π;u,t)Cs(π;u,t)}.(3) V_{alt}(u,t) = \sup_{\pi \in \mathcal{Alt}(u,t)} \Big\{ V(\pi; u,t) - C_s(\pi; u,t) \Big\}. \tag{3}

    Optionally include a cost term CFM(u,t)C_{FM}(u,t) inside VFMV_{FM} to net out access/usage friction of the FM itself.

  • Net advantage at (u,t)(u,t):
    Δ(u,t)=VFM(u,t)Valt(u,t).(4) \Delta(u,t) = V_{FM}(u,t) - V_{alt}(u,t). \tag{4}
  • Positive/negative parts: (x)+=max(0,x)(x)_+ = \max(0,x), (x)=max(0,x)(x)_- = \max(0,-x), with x=x+xx = x_+ - x_-.

2. Net IDO Impact (NII)

We define three related quantities to capture gains, losses, and net welfare:

  • Wins-only (gains):
    NIIgain=E(u,t)PUPTFM[(Δ(u,t))+].(5) \mathrm{NII}_{\text{gain}} = \mathbb{E}_{(u,t) \sim P_U \otimes P_T^{FM}} \Big[ \big(\Delta(u,t)\big)_+ \Big]. \tag{5}
  • Losses:
    NIIloss=E(u,t)PUPTFM[(Δ(u,t))].(6) \mathrm{NII}_{\text{loss}} = \mathbb{E}_{(u,t) \sim P_U \otimes P_T^{FM}} \Big[ \big(\Delta(u,t)\big)_- \Big]. \tag{6}
  • Balanced net welfare:
    NIInet=E(u,t)PUPTFM[Δ(u,t)]=NIIgainNIIloss.(7) \mathrm{NII}_{\text{net}} = \mathbb{E}_{(u,t) \sim P_U \otimes P_T^{FM}} \Big[ \Delta(u,t) \Big] = \mathrm{NII}_{\text{gain}} - \mathrm{NII}_{\text{loss}}. \tag{7}

Recommendation: Report the triplet

NIIgain\mathrm{NII}_{\text{gain}} alone hides harms. Always report (NIIgain,NIIloss,NIInet)(\mathrm{NII}_{\text{gain}}, \mathrm{NII}_{\text{loss}}, \mathrm{NII}_{\text{net}})to give a complete picture of welfare impacts.

3. Dynamic task space decomposition

3.1. Lebesgue decomposition

The post-FM task measure can be uniquely decomposed into a part absolutely continuous w.r.t. PT0P_T^{0} and a singular part:

dPTFM(t)=w(t)dPT0(t)+dPT(t),PTPT0,(8) dP_T^{FM}(t) = w(t)\,dP_T^{0}(t) + dP_T^{\perp}(t), \quad P_T^{\perp} \perp P_T^{0}, \tag{8}

where w=dPTFMdPT0w = \frac{dP_T^{FM}}{dP_T^{0}} is the density ratio on the shared support and PTP_T^{\perp} is the singular (creation) measure. Note:

Tw(t)dPT0(t)=1PT(T). \int_{\mathcal{T}} w(t)\,dP_T^{0}(t) = 1 - P_T^{\perp}(\mathcal{T}).

3.2. Competition-Creation decomposition

Substituting equation (8) into equation (5):

NIIgain=EuPU[T(Δ(u,t))+w(t)dPT0(t)]Competition (exploitation)  +  EuPU[TVFM(u,t)dPT(t)]Creation (exploration)(9) \boxed{ \mathrm{NII}_{\text{gain}} = \underbrace{ \mathbb{E}_{u \sim P_U} \left[ \int_{\mathcal{T}} \big(\Delta(u,t)\big)_+ \, w(t) \, dP_T^{0}(t) \right] }_{\text{Competition (exploitation)}} \;+\; \underbrace{ \mathbb{E}_{u \sim P_U} \left[ \int_{\mathcal{T}} V_{FM}(u,t) \, dP_T^{\perp}(t) \right] }_{\text{Creation (exploration)}} } \tag{9}

The simplification in the creation integral (replacing (Δ)+(\Delta)_+with VFMV_{FM}) corresponds to assuming Valt(u,t)0V_{alt}(u,t) \approx 0for tsupp(PT)t \in \operatorname{supp}(P_T^{\perp}). If nascent alternatives exist for new tasks, use VFMValtV_{FM} - V_{alt} in the creation term as well.

4. Identification via surrogacy

Under the surrogacy assumptions S1-S2 from the IDO framework, let R(k)=fk(S(k),X)R^{(k)} = f_k(S^{(k)}, X) be the calibrated reward on the IDO scale. If S-admissibility fails (i.e., selection into YY^* depends on unobserved confounders), recalibration is required; see IDO appendix §3 for transport tests.

For any policy π\pi, use Direct Method, IPS, or DR estimators to obtain V^(π;u,t)\widehat{V}(\pi; u,t) from logs or fresh draws. Then:

Δ^(u,t)=V^FM(u,t)V^alt(u,t). \widehat{\Delta}(u,t) = \widehat{V}_{FM}(u,t) - \widehat{V}_{alt}(u,t).

4.1. Components needed for identification

  • Alternatives menu. Specify a realistic set Alt(u,t)\mathcal{Alt}(u,t)(e.g., another model, toolchain, human workflow) and measure friction CsC_son the [0,1][0,1] utility scale. For example, Cs=αlatency+βprice+γonboardingC_s = \alpha \cdot \text{latency} + \beta \cdot \text{price} + \gamma \cdot \text{onboarding}, with coefficients calibrated to convert time/cost into utility.
  • Density ratio. Estimate w^(t)dPTFMdPT0(t)\hat{w}(t) \approx \frac{dP_T^{FM}}{dP_T^{0}}(t)using a discriminator (e.g., logistic regression, KLIEP, Bregman divergence methods). Calibrate so that w^dPT01M^create\int \hat{w}\,dP_T^{0} \approx 1 - \widehat{M}_{\text{create}}.
  • Novelty / creation measure. Flag tasks with negligible pre-FM density as T^\widehat{\mathcal{T}}_{\perp} to estimate the creation mass M^create=PTFM(T^)\widehat{M}_{\text{create}} = P_T^{FM}(\widehat{\mathcal{T}}_{\perp}).

5. Estimation

Data requirements

  • Pre- and post-FM task logs with user segments uu and task descriptors tt (taxonomy or frozen embeddings)
  • Judge scores S(k)S^{(k)} and a small oracle slice for YY^* calibration (see IDO appendix)
  • Friction measurements Cs(π;u,t)C_s(\pi; u,t) for alternatives (price, latency, onboarding) mapped to [0,1][0,1] utility scale

5.1. Plug-in Monte Carlo estimator

Sampling notation: Let N0N_0 = number of pre-FM samples (ui,ti)PUPT0(u_i, t_i) \sim P_U \otimes P_T^{0}, N1N_1 = number of post-FM samples (uj,tj)PUPTFM(u_j, t_j) \sim P_U \otimes P_T^{FM}, and NN_\perp = number of post-FM samples flagged as creation tasks (i.e., tjT^t_j \in \widehat{\mathcal{T}}_\perp).

NII^gain=1N0i=1N0(Δ^(ui,ti))+w^(ti)  +  M^create(1Nj:tjT^V^FM(uj,tj)),M^create=NN1.(10) \widehat{\mathrm{NII}}_{\text{gain}} = \frac{1}{N_0} \sum_{i=1}^{N_0} \big(\widehat{\Delta}(u_i, t_i)\big)_+ \, \hat{w}(t_i) \;+\; \widehat{M}_{\text{create}} \cdot \left(\frac{1}{N_\perp}\sum_{j: t_j \in \widehat{\mathcal{T}}_\perp} \widehat{V}_{FM}(u_j, t_j)\right), \quad \widehat{M}_{\text{create}} = \frac{N_\perp}{N_1}. \tag{10}

Use Direct/IPS/DR estimators (preferably DR for stability) to compute each V^\widehat{V}. Similarly estimate:

NII^loss=1N0i=1N0(Δ^(ui,ti))w^(ti),NII^net=NII^gainNII^loss. \widehat{\mathrm{NII}}_{\text{loss}} = \frac{1}{N_0} \sum_{i=1}^{N_0} \big(\widehat{\Delta}(u_i, t_i)\big)_- \, \hat{w}(t_i), \quad \widehat{\mathrm{NII}}_{\text{net}} = \widehat{\mathrm{NII}}_{\text{gain}} - \widehat{\mathrm{NII}}_{\text{loss}}.

5.2. Uncertainty quantification

Inference procedure (extends §5 of IDO appendix)

  • Cross-fit all nuisances: Critics fkf_k, propensities π0\pi_0, and density ratio model w^\hat{w}.
  • Bootstrap over splits: Resample pre/post task distributions for confidence intervals.
  • OUA jackknife: Add oracle-uncertainty-aware variance to capture calibrator uncertainty (see IDO technical appendix §5.3).
  • Clustering: If logs exhibit dependence by user or task family, use clustered bootstrap.

6. Modeling choices and assumptions

Several design decisions require explicit specification:

1. Joint distribution vs product

If the FM changes who does which tasks (e.g., creation concentrated on specific user segments), use joint PU,TFMP_{U,T}^{FM} rather than product PUPTFMP_U \otimes P_T^{FM}.

2. Accounting for FM friction/costs

Choose one approach to avoid double-counting:

  • Option A: Include CFM(u,t)C_{FM}(u,t) inside VFMV_{FM} (i.e., VFM=V(πFM)CFMV_{FM} = V(\pi_{FM}) - C_{FM}), so Δ\Delta compares net FM value to net alternative value.
  • Option B: Treat FM as one candidate in Alt\mathcal{Alt}and define Δ=VFMsupπAltnon-FM{V(π)Cs(π)}\Delta = V_{FM} - \sup_{\pi \in \mathcal{Alt}_{\text{non-FM}}} \{V(\pi) - C_s(\pi)\}, where alternatives exclude the FM itself.

3. Wins-only vs net

NIIgain\mathrm{NII}_{\text{gain}} is useful for "gross gains," but it hides harms. Always report the triplet (NIIgain,NIIloss,NIInet)(\mathrm{NII}_{\text{gain}}, \mathrm{NII}_{\text{loss}}, \mathrm{NII}_{\text{net}}).

4. Scale of Y*

Because tasks differ wildly in stakes, be explicit about the cardinal meaning of [0,1][0,1]: e.g., anchored "willingness-to-pay equivalent," "time-saved scaled by value-of-time," or revealed-preference utility calibrated via human judgments under deliberation.

5. Externalities

If tasks create spillovers (positive or negative) on others, encode them in Yu,tY^*_{u,t} (social welfare version) or track parallel "private utility" vs "social utility" measures.

6. Creation identification

In practice, PTP_T^{\perp} is approximated by "tasks with negligible density under PT0P_T^{0}." You'll need a pre-registered novelty detection threshold and sensitivity analysis.

Longitudinal note: If non-FM alternatives emerge for a new task later, it moves from creation into competition in subsequent time windows.

7. Worked example (discrete case)

Consider a simple discrete setting to illustrate the calculation:

Setup

  • Users: PU(A)=0.6,  PU(B)=0.4P_U(A) = 0.6, \; P_U(B) = 0.4
  • Pre-FM tasks: t1t_1 (translation, 0.7), t2t_2 (formatting, 0.3)
  • Post-FM: w(t1)=0.8,  w(t2)=1.2w(t_1) = 0.8, \; w(t_2) = 1.2, plus new task t3t_3 with creation mass PT({t3})=0.08P_T^{\perp}(\{t_3\}) = 0.08
  • Check: 0.7×0.8+0.3×1.2=0.920.7 \times 0.8 + 0.3 \times 1.2 = 0.92; 0.92+0.08=1.00.92 + 0.08 = 1.0

Values (net of friction)

UserTaskVFMV_{FM}ValtV_{alt}Δ\Delta
At1t_10.850.80+0.05
Bt1t_10.750.83−0.08
At2t_20.700.50+0.20
Bt2t_20.620.55+0.07
At3t_30.60~0+0.60
Bt3t_30.72~0+0.72

Competition weights

For t1t_1: 0.7×0.8=0.560.7 \times 0.8 = 0.56

  • User A: 0.6×0.56=0.3360.6 \times 0.56 = 0.336
  • User B: 0.4×0.56=0.2240.4 \times 0.56 = 0.224

For t2t_2: 0.3×1.2=0.360.3 \times 1.2 = 0.36

  • User A: 0.6×0.36=0.2160.6 \times 0.36 = 0.216
  • User B: 0.4×0.36=0.1440.4 \times 0.36 = 0.144

Creation weights

PT({t3})=0.08P_T^{\perp}(\{t_3\}) = 0.08

  • User A: 0.6×0.08=0.0480.6 \times 0.08 = 0.048
  • User B: 0.4×0.08=0.0320.4 \times 0.08 = 0.032

Calculation

Competition gains:

  • A, t1t_1: 0.336×0.05=0.01680.336 \times 0.05 = 0.0168
  • A, t2t_2: 0.216×0.20=0.04320.216 \times 0.20 = 0.0432
  • B, t2t_2: 0.144×0.07=0.010080.144 \times 0.07 = 0.01008

Creation gains:

  • A, t3t_3: 0.048×0.60=0.02880.048 \times 0.60 = 0.0288
  • B, t3t_3: 0.032×0.72=0.023040.032 \times 0.72 = 0.02304

Loss (for net calculation):

  • B, t1t_1: 0.224×0.08=0.017920.224 \times 0.08 = 0.01792

Final results

NIIgain\mathrm{NII}_{\text{gain}} = 0.0168 + 0.0432 + 0.01008 + 0.0288 + 0.02304 = 0.12192

NIIloss\mathrm{NII}_{\text{loss}} = 0.01792

NIInet\mathrm{NII}_{\text{net}} = 0.12192 − 0.01792 = 0.104

8. What to report

NII Scoreboard (minimum reporting bundle)

  • Triplet: NIIgain\mathrm{NII}_{\text{gain}}, NIIloss\mathrm{NII}_{\text{loss}}, NIInet\mathrm{NII}_{\text{net}}(on IDO scale, with 95% CIs)
  • Creation mass: Mcreate=PT(T)M_{\text{create}} = P_T^{\perp}(\mathcal{T})and mean VFMV_{FM} on creation tasks
  • Competition lift: Mean (Δ)+(\Delta)_+ on shared support weighted by w(t)w(t)
  • Top negative segments: Rank (u,t)(u,t) cells by (Δ)w(t)(\Delta)_- \cdot w(t) and report the smallest friction change that would flip them positive
  • Segmented breakdowns: By user group and task family
  • Sensitivity: Across novelty thresholds and friction assumptions

9. Diagnostics and guardrails

Beyond the standard IDO diagnostics (transport tests, ESS, coverage), NII requires additional checks:

1. Ratio quality

  • Expected w^\hat{w} under PT0P_T^{0} should be close to 1M^create1 - \widehat{M}_{\text{create}}
  • Check tail behavior: plot distribution of w^\hat{w} and compute ESS
  • Cross-validation: hold-out calibration check for the discriminator

2. Novelty threshold sensitivity

  • Plot creation mass and creation value across a range of thresholds
  • Pre-register one primary threshold before seeing post-FM data
  • Use dual thresholds (conservative and liberal) for robustness bounds

3. Negative segment analysis

  • Identify top (u,t)(u,t) cells by weighted loss: (Δ)w(t)(\Delta)_- \cdot w(t)
  • For each, compute the friction adjustment that would make Δ0\Delta \geq 0
  • Flag segments where small latency/price changes flip the decision

4. Support mismatch detection

  • Verify ratio estimators are only applied on shared support
  • Use density plots to visualize overlap between PT0P_T^{0} and PTFMP_T^{FM}
  • Report percentage of post-FM mass with negligible pre-FM density

Common pitfalls

  • Support mismatch: Density ratio estimators fail outside shared support. Always carve out PTP_T^{\perp} first via novelty detection.
  • Cardinality drift: Task representation tt must be stable across time (use frozen encoders or versioned taxonomy).
  • Missing propensities: If logged behavior policies are unavailable, prefer doubly-robust estimation for VV.
  • Time dynamics: Report NII over windows as PTFMP_T^{FM} evolves with technology diffusion.

Summary

  • Target: Net advantage over realistic alternatives, decomposed into competition (wins/losses on existing tasks) and creation (value on new tasks).
  • Mechanism: Lebesgue decomposition dPTFM=w(t)dPT0+dPTdP_T^{FM} = w(t)\,dP_T^{0} + dP_T^{\perp}cleanly separates these sources of value.
  • Identification: Combine IDO framework (calibrated surrogates for VV) with density-ratio estimation (ww) and novelty detection (PTP_T^{\perp}).
  • Reporting: Always present the triplet (NIIgain,NIIloss,NIInet)(\mathrm{NII}_{\text{gain}}, \mathrm{NII}_{\text{loss}}, \mathrm{NII}_{\text{net}}), creation mass, and negative segment analysis.
  • Governance: Makes "compared to what?" explicit and quantifiable with honest uncertainty.

This extends the IDO measurement framework from single-policy evaluation to competitive benchmarking, turning "is this model better?" into a rigorous, measurable question with estimators, confidence intervals, and falsifiable diagnostics.

Citation

If you use this work, please cite:

BibTeX

@misc{landesberg2025nii,
  author = {Landesberg, Eddie},
  title = {Net IDO Impact: Evaluating Foundation Models Against Realistic Alternatives},
  year = {2025},
  month = {November},
  url = {https://cimolabs.com/blog/net-ido-impact},
  note = {CIMO Labs Technical Report}
}

Plain Text

Landesberg, E. (2025). Net IDO Impact: Evaluating Foundation Models Against Realistic Alternatives. CIMO Labs Technical Report. https://cimolabs.com/blog/net-ido-impact

References

[1] Kallus, N., & Mao, X. (2024). On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv:2003.12408. arXiv
[2] Pearl, J., & Bareinboim, E. (2014). External validity: From do-calculus to transportability across populations. Statistical Science, 29(4), 579–595. DOI
[3] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. DOI
[4] Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density Ratio Estimation in Machine Learning. Cambridge University Press. DOI