CIMO Labs — Causal Evaluation for LLM Systems

What is CIMO?

The quality of any decision depends on causal information—your knowledge about what would happen in parallel universes where you chose differently (see: "Want to make good business decisions? Learn causality").

Choosing an AI model, a prompt, a retrieval pipeline—these are all predictions about potential outcomes. Which parallel universe gives you the best experience? The answer depends entirely on the causal information you have access to.

CIMO (Causal Information Manifold Optimization) is our framework for the economics of information: What's the cost and value of different information sources? Where is ROI highest? How do you frame problems to maximize decision performance by optimizing over the manifold of causal information that determines outcome quality?

The Problem

There's an interesting tension in how AI systems get evaluated today.

For traditional product changes—ranking algorithms, pricing models, core features—teams at scale use A/B tests. Randomization, statistical significance, confidence intervals. The infrastructure is mature and the discipline is established.

But A/B tests are slow. Weeks of production traffic. Real users exposed to potentially worse variants. Infrastructure overhead to serve multiple models in parallel. For teams iterating on prompts or evaluating new models, that cycle time is prohibitive.

So teams use offline evaluations: generate responses, score with an LLM judge, compare averages. Fast, cheap, enables rapid iteration. The challenge is calibration.

A judge score improves from 7.2 to 7.8. What does that mean for the KPI you actually care about—conversion, retention, user satisfaction? Without calibration to ground truth, it's unclear. The judge is measuring something, but the mapping to production outcomes isn't guaranteed.

This creates a tradeoff: rigorous but slow (A/B tests) versus fast but uncertain (uncalibrated offline evals).

Our Solution

We're building infrastructure for offline evaluations with the statistical rigor of A/B tests. That requires making causal assumptions explicit, calibrating judge scores to outcomes that matter, and quantifying uncertainty honestly.

Our tools turn unreliable evaluation signals into audit-ready estimates with confidence intervals you can defend to stakeholders, regulators, and yourself.

Research Focus

We work at the intersection of causal inference, off-policy evaluation, and LLM systems. Our current focus areas:

Surrogate Metrics & Calibrated Evaluation

Treating fast metrics (judge scores, clicks, engagement) as calibrated surrogates for an Idealized Deliberation Oracle—what unlimited human deliberation would conclude. Uses mean-preserving transformations and oracle-uncertainty aware confidence intervals.

Off-Policy Estimation for LLMs

Adapting importance sampling and doubly robust methods to handle distributional shifts, heavy-tailed weights, and partial observability in language model deployments.

Diagnostic Infrastructure

Building operator-facing tools that surface coverage issues, overlap problems, and calibration failures—with concrete remediation strategies.

Approach

We believe evaluation infrastructure should be:

Statistically principled. Every estimate comes with a confidence interval that accounts for all sources of uncertainty.
Causally interpretable. We estimate what would happen if you deployed the policy, not just observational correlations.
Diagnostic-first. Tools should tell you when they're unreliable and how to fix it, not just silently fail.
Practitioner-focused. Built for teams shipping production systems, not just researchers publishing papers.

Open Source

Our core tools are open source. We believe the LLM ecosystem needs shared infrastructure for rigorous evaluation, not proprietary black boxes.

CJE (Causal Judge Evaluation) is our flagship library for turning judge scores into statistically valid estimates. MIT licensed, actively maintained, with comprehensive documentation.

View on GitHub →

Founder

Eddie Landesberg is a research scientist and software engineer focused on causal evaluation for AI systems.

Previously at Stitch Fix, Eddie built a causally rigorous advertising spend optimization system from scratch that managed $150M/year in spend; randomized experiments indicated ~$40M/year in incremental efficiency gains. He also improved the deep learning personalization engine powering the company's core product and authored "Want to make good business decisions? Learn causality"—one of the most cited posts on Stitch Fix's technical blog, featured in university curricula and cited in the Academy of Management Review.

At Salesforce, he was the first data science hire in the marketing org and led the first machine learning models deployed to salesforce.com end-to-end. As co-founder and CEO at Fondu, he built consumer-facing long-term memory for LLMs—thousands of users, ~2 hours/WAU, ~40% D30 bounded retention—and was featured in a16z's Building the AI Brain.

Eddie holds a BA in Economics from Georgetown University. He has guest lectured at Stanford MS&E on the economics of consumer data.

CIMO Labs is backed by former C-level tech leaders and researchers from Stitch Fix, Netflix, Meta, and AllianceBernstein.

Contact

We work with teams facing challenging evaluation problems—high-stakes deployments, limited labeled data, distributional shift, or regulatory requirements.

Discuss Your Evaluation Challenges GitHub

About CIMO Labs