Your AI Is Optimizing the Wrong Thing
Your prompt says “be helpful” but your rubric measures “response length.” Here’s how to align your AI system’s generation and evaluation to the same welfare target (Y*)—making optimization actually work and evaluation actually interpretable. Zero math, copy-paste practical.
TL;DR
- The problem: Your prompt says "maximize user welfare," but your rubric scores "response length." The AI optimizes for what you measure, not what you asked for. Misalignment between generation and evaluation = Goodhart's Law at scale.
- The fix: Y*-alignment means both your prompt and your rubric target the same welfare construct (Y*). Use a Standard Deliberation Protocol (SDP) to operationalize Y*—a fixed, versioned procedure that both generation and evaluation follow. Copy-paste the templates below.
- What you get: Judge-pool invariance (different judges converge), optimization compatibility (no gaming), transportable off-policy evaluation, and causal interpretability.
When your prompt and your rubric tell opposite stories
Here’s a scenario that plays out daily in AI teams: You deploy a customer service agent. Your prompt says:
Sounds good. But then your evaluation rubric says:
See the problem? You asked the AI to optimize customer satisfaction, but you’re measuring brevity and refund avoidance. The AI will learn to optimize what you measure, not what you asked for. You get terse, unhelpful responses that technically comply with policy but leave customers frustrated.
Concrete Failure Mode
Misaligned optimization: Your prompt says “be helpful,” but your rubric measures “response time.” The AI learns to give instant, superficial answers that score well but don’t actually help users.
Goodhart’s Law in action: “When a measure becomes a target, it ceases to be a good measure.” The AI games the metric you chose, not the outcome you care about.
Evaluation breakdown: Your off-policy estimates become uninterpretable because different judge pools prioritize different dimensions (some value empathy, others value brevity), making welfare comparisons meaningless.
This isn’t just a prompt engineering problem—it’s a causal alignment problem. If your generation target and your evaluation target are measuring different things, you get:
- Optimization failures: The AI optimizes for the wrong objective
- Gaming: The AI learns to exploit the gap between what you asked for and what you measure
- Uninterpretable evaluations: Scores don’t reflect actual welfare, so A/B tests are meaningless
- Non-transportability: Results don’t generalize across contexts or judge pools
The fix? Y*-alignment: Both generation and evaluation target the same welfare construct.
What is Y*?
Y* (pronounced “Y-star”) is your North Star—the true welfare outcome you care about, normalized to a 0-1 scale.
More formally, Y* is shorthand for the Idealized Deliberation Oracle (IDO): the judgment you’d make if you could deliberate perfectly with complete information, reflective consistency, and impartial aggregation.
What does “perfect deliberation” mean? Three ingredients:
- Complete information: All relevant facts, consequences, and context
- Reflective consistency: Time to think through trade-offs and implications
- Impartial aggregation: Balanced consideration of all stakeholder perspectives
Why Y* Matters
One target, two uses: Y* is what you tell the AI to optimize and what your judges evaluate. Same scale, same construct, same welfare definition.
Causal interpretability: When generation and evaluation align to Y*, scores actually mean something. A 0.7 on the Y* scale represents the same welfare level whether it came from Policy A or Policy B, from Judge Pool 1 or Judge Pool 2.
Audit-ready: Y* gives you a principled answer to “what are you optimizing for?” and “how do you know this system is better?”—because you can point to the same deliberative standard used in both generation and evaluation.
What does Y* look like in practice? Here are three common domain archetypes:
These are illustrative. Your Y* definition should be domain-specific and require deep domain knowledge.
Product — optimize lifetime value, not clicks
Common misalignment: Prompt says “help users make informed decisions,” but rubric scores “click-through rate” or “sign-ups.”
Y*-aligned target: Product leader deliberation on long-term customer value: balances conversion with retention and satisfaction, flags when persuasive patterns may backfire, optimizes for customers who stay and refer others. What you’d recommend if measured on 12-month LTV, not next-quarter sign-ups.
Safety — optimize protection, not refusal rates
Common misalignment: Prompt says “provide safe responses,” but rubric scores “policy compliance checkmarks” or “refusal percentage.”
Y*-aligned target: Security expert deliberation on actual risk reduction: distinguishes real threats from false positives, balances safety with utility, applies context-appropriate mitigations. What a careful security team would approve after reviewing threat models and user impact, not what maximizes test-suite pass rates.
Health — optimize outcomes, not satisfaction
Common misalignment: Prompt says “provide safe medical guidance,” but rubric scores “patient satisfaction” or “reassurance.”
Y*-aligned target: Board-certified specialist deliberation on safety, efficacy, and long-term health: balances symptom relief with evidence-based outcomes, flags when more information is needed, escalates appropriately. What a careful physician would recommend after reviewing all available evidence, not what maximizes immediate satisfaction scores.
Notice the pattern: Y* is always about the outcome you actually care about, not surface signals that correlate poorly with that outcome.
What is the Idealized Deliberation Oracle (IDO)?
The Idealized Deliberation Oracle (IDO) is the formal name for Y*—the judgment you would make if you could deliberate perfectly with complete information, unlimited time for reflection, and impartial consideration of all stakeholders.
Think of it as the "ground truth" for welfare judgments. In reality, you can never access the IDO directly (perfect deliberation is infinitely expensive), but it serves as the theoretical target that both your AI system and your evaluation should aim toward.
Why call it an "oracle"? In computer science and statistics, an oracle is an idealized black box that gives you the right answer. The IDO is the oracle for welfare: if you could query it, it would tell you the true welfare value of any outcome. In practice, you approximate it using Standard Deliberation Protocols (SDPs) with human experts or careful LLM judges.
Relationship to Y*: Y* is shorthand for IDO(task, policy). It's the oracle's judgment, normalized to a 0-1 scale, for a specific task and policy combination. When we say "target Y*," we mean "aim for what the IDO would conclude about welfare."
What is Y*-alignment?
Y*-alignment means both your generation prompt and your evaluation rubric target the same welfare construct (Y*).
When a system is Y*-aligned:
- Your AI prompt says: "Optimize for Y* (e.g., long-term customer value)"
- Your judge rubric measures: "How well does this response optimize Y*?"
- Both use the same deliberation procedure to approximate Y* in practice
This prevents Goodhart's Law at scale. Without Y*-alignment, your prompt says "be helpful" but your rubric measures "response length." The AI optimizes for what you measure, not what you asked for. With Y*-alignment, generation and evaluation speak the same language—the language of welfare.
Recent evidence: OpenAI’s deliberative alignment
In December 2024, OpenAI released deliberative alignment (Guan et al., 2025), an approach that directly teaches o-series models the text of safety specifications and trains them to reason over these specs at inference time using chain-of-thought.
Key result: By teaching models explicit safety specifications (rather than inferring from examples), OpenAI achieved a Pareto improvement on both under- and overrefusals—simultaneously better at avoiding harmful outputs while being more permissive with benign prompts. The approach also showed strong generalization to out-of-distribution safety scenarios.
Connection to Y*-alignment: Deliberative alignment is Y*-alignment for safety. OpenAI:
- Defines the target explicitly (safety specifications = their Y*)
- Teaches models to reason over the target (CoT over specs = structured deliberation)
- Achieves better calibration (Pareto improvement = no gaming the metric)
- Gets strong transportability (OOD generalization = specification-awareness travels)
This validates the core insight: explicit welfare targets + structured deliberation beats inferring behavior from examples. What OpenAI does for generation, CJE + Y*-alignment extends to evaluation: align both your prompt and your rubric to the same explicit welfare construct.
But here’s the catch: You can’t directly access Y* in practice. Perfect deliberation is expensive, slow, and often impossible. So instead, you operationalize Y* at a practical level that approximates it well enough for your use case.
That’s where the Deliberation Ladder comes in.
How to operationalize Y*: The Deliberation Ladder
You can’t directly measure Y* (perfect deliberation is infinite-cost). Instead, you climb the ladder from cheap signals toward Y*.
The Deliberation Ladder
Idealized Deliberation Oracle (IDO)
What you'd decide with unlimited time, perfect information, and reflection.
Examples: Customer lifetime value, long-term health/happiness, future life satisfaction
Oracle / High-Rung Outcome
Expensive but practical labels that better approximate Y*.
Examples: Expert audits, task success, long(er)-run outcomes, 30–90d retention, expert panel rating, task completion rate
Cheap Surrogate
Fast signals you can collect at scale.
Examples: LLM-judge scores, clicks, watch-time, quick human labels, immediate judge score, BLEU, MTurk rating
How the ladder works in practice
- S (Cheap Surrogate): Fast signals you can collect at scale—LLM-judge scores, response length, immediate ratings. Abundant but noisy.
- Y (Oracle / High-Rung Outcome): Expensive but practical labels that better approximate Y*—expert audits, deliberative panel ratings, task success after reflection. Use these to calibrate S.
- Y* (Idealized Deliberation Oracle): The theoretical target—what you’d conclude with complete information, reflective consistency, and impartial aggregation. You never measure this directly, but it’s what both your prompt and your judge target.
The Y*-alignment strategy:
- Target Y*: Tell your generation prompt and your judge rubric to aim for the same idealized welfare outcome
- Use SDP: Define a Standard Deliberation Protocol (SDP v1.0) with concrete steps (gather evidence → assess impacts → counter-position → answer) that both generation and judges follow
- Calibrate S using oracle labels: Collect operational labels Y (judges following SDP) on a small slice, learn the mapping φ: S → Y*
- Scale with S: Use calibrated cheap signals at scale, knowing they target the same Y* as your prompt via SDP
Common Mistake: Vague Target
Don’t just write: “Provide the best response” or “Be helpful.”
Why it fails: The AI doesn’t know what “best” means. Judges interpret it differently. You get inconsistent optimization and evaluation.
Instead: Define Y* explicitly (e.g., “maximize long-term customer value” or “minimize actual risk after full deliberation”) and give a concrete deliberation procedure to approximate it. Both prompt and rubric use the same definition and procedure.
This is the heart of Y*-alignment: Both your generation prompt and your judge rubric target the same Y* and use a consistent deliberation procedure to approximate it.
What is a Standard Deliberation Protocol (SDP)?
A Standard Deliberation Protocol (SDP) is a fixed, versioned procedure that operationalizes how to approximate Y* in practice.
Think of it as a recipe for deliberation. Instead of saying "be helpful" (vague and unactionable), an SDP gives concrete steps:
- Gather evidence: Collect all relevant facts, constraints, and context
- Assess impacts: Consider stakeholder perspectives and consequences
- Counter-position: Identify and address alternative viewpoints
- Answer: Make a judgment on the 0-1 Y* scale
Why version it? When you update your SDP (e.g., add a new step, change the rubric), you bump the version (v1.0 → v1.1 → v2.0). This creates an audit trail and makes evaluations reproducible. You always know which deliberation procedure produced which labels.
Where to use SDP: Both your generation prompt and your judge rubric reference the same SDP. This ensures both are targeting Y* in the same way, creating alignment between optimization and evaluation.
Now let’s make it concrete with copy-paste templates.
4. Copy-Paste Templates
These templates operationalize Y*-alignment using Standard Deliberation Protocol (SDP v1.0). Copy-paste into your prompts and rubrics, then adapt to your domain. The version string (v1.0, 2025-11-11) should be logged with every evaluation for audit trails.
4.1 Generation Prompt Template
Use this in your AI system prompt to align generation to Y*.
What this does:
- Defines Y* as the idealized welfare target
- Specifies SDP v1.0 (4 concrete steps with versioning)
- Requires explicit gap reporting (what’s missing to reach ideal)
- Outputs confidence level for transparency
4.2 Judge Rubric Template
Use this rubric for human or LLM judges evaluating responses.
What this does:
- Uses the same Y* definition as the generation prompt
- Specifies the same SDP v1.0 (4 concrete steps with versioning)
- Requires explicit gap-to-ideal reporting (what’s uncertain)
- Outputs Y score on [0,1] scale (operational welfare label)
5. Two-Week Rollout Plan
Here’s a practical plan to implement Y*-alignment in your system:
Week 1: Baseline & Pilot Setup
Day 1-2: Document your current prompt and rubric. Identify misalignments (e.g., prompt says “helpful,” rubric measures “brevity”).
Day 3-4: Adapt the templates above to your domain. Replace generic examples with your actual use case.
Day 5: Run baseline evaluation: Have 2-3 judges score 30 responses using your current rubric. Calculate inter-rater reliability (ICC or Krippendorff’s α).
Day 6-7: Implement Y*-aligned prompt and rubric. Deploy side-by-side: 50% traffic to old prompt, 50% to new.
Week 2: Testing & Diagnostics
Day 8-10: Collect 100 responses from each prompt. Have the same judge pool evaluate all 200 using the Y*-aligned rubric.
Day 11: Run diagnostics (see checklist below). Check for:
- Judge-pool invariance (do different judges converge to similar Y* scores?)
- Gap-to-ideal reporting consistency (are judges flagging similar uncertainties?)
- Improvement in welfare scores (Y*-aligned vs baseline)
Day 12-13: Analyze results. If Y*-aligned prompt shows 10%+ improvement in mean Y* score and judges show high agreement, proceed to full rollout. If not, iterate on prompt/rubric.
Day 14: Document findings. Write a one-page summary: “What we changed, why, and what improved.”
Success Criteria:
- Judge agreement: ICC > 0.7 (judges using Y*-aligned rubric converge)
- Welfare improvement: Mean Y* score increases by 10%+ vs baseline
- Gap reporting: Judges consistently identify similar gaps-to-ideal (indicates shared understanding of Y*)
- No gaming: AI doesn’t exploit rubric loopholes (check manually on 20 edge cases)
6. Practical Diagnostics Checklist
Run these checks after implementing Y*-alignment to verify it’s working:
Judge-Pool Invariance
Test: Have two different judge pools (e.g., contractors vs internal experts) score the same 50 responses using the Y*-aligned rubric.
Pass criterion: Mean Y* scores differ by <5%, and confidence intervals overlap. This indicates both pools are estimating the same Y* construct.
Gap-to-Ideal Consistency
Test: Review gap-to-ideal reports from 30 evaluations. Cluster by theme (e.g., “missing long-term impact data,” “unclear stakeholder preferences”).
Pass criterion: 70%+ of gaps fall into 3-5 recurring themes. This indicates judges share a common understanding of what’s missing to reach ideal deliberation.
No Gaming / Goodhart Check
Test: Manually review 20 responses with high Y* scores (0.8+). Check if they actually maximize welfare or just game the rubric (e.g., verbose deliberation that doesn’t add value).
Pass criterion: 90%+ of high-scoring responses genuinely reflect strong deliberation quality, not surface-level mimicry.
Welfare Improvement vs Baseline
Test: Compare mean Y* scores from Y*-aligned prompt vs baseline prompt (using the Y*-aligned rubric for both).
Pass criterion: Y*-aligned prompt shows 10%+ improvement in mean Y* score, with p < 0.05 (t-test or bootstrap).
Calibration Quality
Test: On a held-out oracle set (responses with expert-consensus Y* labels), fit calibrator φ: S → Y* and check calibration error (mean squared error of φ(S) vs Y*).
Pass criterion: MSE < 0.05 and residuals show no systematic bias (plot residuals vs predicted Y*).
If any diagnostic fails, iterate on your prompt or rubric. Common fixes: add more specific deliberation steps, clarify Y* definition for your domain, or provide example evaluations with gap-to-ideal reports.
Citation
If you use this work, please cite:
BibTeX
@misc{landesberg2025ystar,
author = {Landesberg, Eddie},
title = {Your AI Is Optimizing the Wrong Thing: Y*-Aligned Systems},
year = {2025},
month = {November},
url = {https://cimolabs.com/blog/y-star-aligned-systems},
note = {CIMO Labs}
}Plain Text
Landesberg, E. (2025). Your AI Is Optimizing the Wrong Thing: Y*-Aligned Systems. CIMO Labs. https://cimolabs.com/blog/y-star-aligned-systems
7. FAQ: Common Questions
Q: Isn’t Y* just “write good prompts”?
A: No. Y*-alignment is specifically about aligning generation and evaluation to the same welfare target. Many systems have good prompts but misaligned rubrics, leading to optimization failures. Y*-alignment ensures both use the same deliberative standard.
Q: Can I use Y*-alignment with model-based judges (LLMs)?
A: Yes. Use the same rubric template for LLM judges. You’ll still need an oracle slice (expert labels) to calibrate S → Y*, but once calibrated, LLM judges can scale cheaply while preserving Y*-alignment.
Q: What if my domain doesn’t have a clear “welfare” concept?
A: Replace “welfare” with your target construct. For code generation, Y* might be “correctness + maintainability under expert review.” For medical advice, Y* might be “patient safety + efficacy under specialist deliberation.” The key is: define one target, use it in both generation and evaluation.
Q: How much does this cost?
A: Initial setup: ~3-5 days of prompt engineering + 100-200 oracle labels ($500-$2000 depending on judge rates). Ongoing: calibration updates every quarter. SDP adds ~15-30% latency vs immediate gut reactions, but you can mitigate with caching and batch processing.
Q: What if judges still disagree after calibration?
A: Two causes: (1) Gap-to-ideal is large: Judges are uncertain about the same missing information. Solution: Flag as “high-uncertainty” and escalate to expert review. (2) Rubric is underspecified: Judges interpret Y* differently. Solution: Add domain-specific examples and clarify deliberation steps.
Q: Can I use Y*-alignment for reinforcement learning from human feedback (RLHF)?
A: Yes. Use the Y*-aligned rubric to generate preference labels for RLHF. This ensures your reward model optimizes Y* instead of surface correlates. Major improvement over standard RLHF, which often learns to game underspecified preferences.
Q: How do I handle tasks where Y* changes over time (e.g., news, fashion)?
A: Y* is context-dependent, so include temporal context in your SDP. Example: “Given current information as of [date], estimate Y using SDP v1.0.” Recalibrate quarterly to adapt to shifting norms, and bump the SDP version if the procedure changes.
Start aligning today
Y*-alignment isn’t a research project—it’s a two-week implementation with immediate payoffs. You’ll catch misalignments before they become memes, make optimization actually work, and stop gaming the metric.
Start with the templates above. Run a pilot. Check diagnostics. Ship aligned systems.
Ready to dive deeper?
Explore the formal framework, implementation examples, or get help with your use case.
We welcome your feedback
This framework is actively evolving. We invite and appreciate constructive criticism from practitioners and researchers.
If you spot errors, have suggestions for improvement, or want to discuss your implementation, please let us know or email us at eddie@cimolabs.com. Your feedback helps us build better tools for the community.
