For any engineering team shipping a product powered by a Large Language Model (LLM), hallucination is not a research curiosity. It’s a production reliability issue, a major source of user distrust, and often, a deployment blocker. A model that confidently makes things up can be worse than one that admits it doesn’t know the answer.

The usual response involves prompt engineering or choosing a better model. But what happens when that’s not enough? You have to build guardrails.

The Constraint

The team’s approach was shaped by its constraints. This is what makes it a useful real-world example.

Timeline: Four weeks. This forced them to prioritize a simple, modular architecture over a complex, monolithic one. Success Metric: Reduce factual hallucinations by at least 15% on the SimpleQA and TruthfulQA benchmarks. A clear, measurable goal. Budget: The average token cost per query should not exceed 2-3 times that of a simple, single-shot baseline. This is the critical business constraint that prevents a solution from being too expensive to run in production.

These constraints meant every component had to justify its existence and its cost.

A Four-Stage Funnel with Early Exits

The team’s solution is a multi-stage pipeline designed to escalate interventions. Think of it like a funnel: it applies the cheapest checks first and only brings in the expensive ones when absolutely necessary.

The most important architectural principle here is early termination. If any stage produces a high-confidence answer, the pipeline can exit immediately, saving the cost and latency of running the remaining stages.

This is not just a chain of prompts; it’s a system designed to manage a budget. Let’s look at each stage.

Stage 1: Automatic Prompt Optimization (APO)

The cheapest way to prevent a bad answer is to start with a good question. This first stage doesn’t try to answer the user’s query at all. Instead, it uses an LLM to rewrite the original prompt into a version that is less ambiguous and more likely to elicit a factual response.

How it works: It takes a vague or poorly phrased question and reformulates it based on a set of principles designed to reduce hallucinations.

Why it’s first: It sanitizes the input before the main, expensive generation model ever sees it. This is the cheapest, most effective first line of defense.

Stage 2: Chain-of-Verification (CoVe)

Once the prompt is optimized, the model is asked to generate an answer. But instead of trusting that answer, the pipeline forces the model to generate a plan to verify its own work. It has to show its reasoning by breaking the answer down into verifiable sub-questions and checking them.

How it works: The model first drafts an answer, then generates a series of verification questions, answers those questions, and finally produces a corrected, final answer.

Why it’s useful: It’s a structured way to force the model to self-audit, catching logical errors or factual inconsistencies before the process goes any further.

Stage 3: Intrinsic Self-Correction

This is a final, low-cost check. The model is asked to take its own candidate answer and review it one last time, but at a lower temperature (making it less “creative” and more deterministic).

How it works: It’s like telling someone, “Are you sure? Read that back to me one more time, carefully.” This simple step reduces variance and often catches small factual drifts.

Why it’s a good trade-off: It’s a very cheap way to get a final layer of confidence before escalating to the most expensive stage.

Stage 4: The External LLM Judge

If the pipeline still isn’t confident, it escalates. A separate, independent LLM is brought in to act as a judge. Its only job is to provide a final pass/fail judgment on the factuality of the answer.

How it works: The judge receives the original question and the final candidate answer. It returns a simple score: factually correct or not.

Why it’s last: This is the most computationally expensive step, so it’s used as a last resort. It’s the “human supervisor” of the pipeline, but automated.

Key Trade-off: Reliability vs. Cost

The fundamental trade-off of this architecture is increased reliability at the expense of a higher token budget and latency. Each stage adds another layer of defense against hallucinations but also incurs additional LLM calls.

The team’s initial smoke tests on a small sample of SimpleQA items quantified this trade-off clearly:

APO Only: The average cost was 162 tokens per query. APO was the dominant contributor, accounting for 81% of the total budget.

APO + CoVe: The average cost increased to 245 tokens per query. CoVe added a significant but smaller increment to the budget.

APO + CoVe + Self-Correct: The average cost rose to 312 tokens per query.

Full Pipeline (with Judge): The average cost was 377 tokens per query. The external judge proved to be a noticeable tail cost.

By wrapping the context passed between stages, ScaleDown’s intelligent pruning can analyze the entire “snowball” of text. For example, before passing the context to the final Judge, it can identify and remove redundant information from the original prompt that is already summarized in the CoVe plan, or compress the verification steps themselves, preserving only the semantically essential information needed to make a judgment.

The result is that the prompt for the (expensive) Judge model can be significantly smaller, sometimes by 40-60% or more, while retaining 100% of the factual information. This directly attacks the compounding token cost and the associated processing latency, turning a geometrically scaling cost problem into a much more manageable, linear one.

It’s a critical, tactical choice that makes this kind of robust, multi-stage architecture economically viable in a production environment. You can view the full implementation, including the composable stages and evaluation framework, in their public repository.