© 2026 NervNow™. All rights reserved.

Will AI Hallucinations Ever Go Away?
Vendors say hallucination rates are falling. Researchers say full elimination is mathematically impossible. Both are right, and the gap between those two statements is where enterprise AI strategies are currently breaking down.

· AI Fundamentals · Explainer
Will AI Hallucinations
Ever Go Away?
Two of the world’s biggest consulting firms learned this the hard way. The research says the rest of the industry is next.
The firms that got caught
Last month, KPMG pulled one of its flagship reports from circulation. The report, titled “Redefining Excellence in the Age of Agentic AI,” had been distributed by the firm’s consulting groups across multiple countries, with local marketing contacts attached. Research outfit GPTZero then conducted a forensic review of its 45 citations. Of those, only five accurately pointed to real sources.
At least 16 were hallucinations. Another 12 were too ambiguous or flawed to determine whether a source existed at all. The errors were not random. Most citations had at least one hallucinated component, typically the title. Many also had fabricated dates or authors.
The report claimed Emirates airline had adopted a mobile chatbot named Sara that could converse directly with passengers and change their flights. Sara is a robot assistant introduced by Emirates in 2023. It is not a chatbot. It cannot alter flight bookings.
The report also described purported agentic AI deployments at UBS, the NHS, Swiss Federal Railways, and Transport for London. UBS said the assertions about its operations were factually incorrect. The other organizations told the Financial Times the descriptions were either inaccurate or misleading.
KPMG removed the report and opened an internal review. Its statement: “We expect all our people to follow our guidelines on the responsible use of AI, including human oversight to validate content and verify independent sources.”
The bitter detail: this was a report about the promise of AI, written with AI, undone by AI.
It was not the first. Deloitte Australia had already agreed to partially refund the Australian government for a $290,000 report containing AI-generated errors. The document, a 237-page independent assurance review of a welfare compliance system, included references to multiple academic papers that did not exist, citations attributed to a real professor at the University of Sydney for work she had never published, and a fabricated quote attributed to an actual federal court judgment. The judge’s name was even misspelled.
Sydney University researcher Chris Rudge spotted the problem almost immediately after publication. “I instantaneously knew it was either hallucinated by AI or the world’s best kept secret because I’d never heard of the book and it sounded preposterous,” he told the Associated Press.
Deloitte reviewed the report, confirmed some footnotes and references were incorrect, and agreed to refund the government’s final payment installment. A revised version was published with a disclosure noting that Azure OpenAI GPT-4o had been used as part of the technical workstream. Of the 141 sources originally cited, 14 were removed in the revised version.
Deloitte said the substance of the review was retained and the recommendations unchanged. Australian Labor Senator Deborah O’Neill responded: “Deloitte has a human intelligence problem.”
EY withdrew a loyalty rewards program report the same month after GPTZero identified fake footnotes and errors following the same pattern. Three of the four largest professional services firms in the world. Same failure mode. Same cause. Same tool.
Publications by firms like these poison the well of information, because their work is treated as credible and referenced by other publications, creating chains of second-hand hallucination that are almost impossible to trace. — Edward Tian, GPTZero CEO
Why the architecture makes this inevitable
An AI hallucination is not a mistake in the way a human makes a mistake. Understanding why requires a brief look at what these systems actually do, because the mechanism matters for everything that follows.
Every large language model, regardless of provider, operates on a single core principle: given the text that has come before, predict the most statistically probable next piece of text. The model does not retrieve facts from a database. It generates, token by token, the continuation that its training across billions of documents tells it is most likely to follow the input it has been given.
This process is extraordinarily good at producing fluent, contextually appropriate, stylistically coherent text. It is also, by design, indifferent to truth in any foundational sense. If the context suggests that a citation belongs at this point in a report, the model produces a citation. Whether that citation corresponds to something real is a separate question the architecture does not automatically ask.
This is why the Deloitte report looked exactly like a properly researched document. The paper titles were plausible. The author names were formatted correctly. The dates fell within credible ranges.
In 2024, researcher Xu et al. formalized this mathematically: eliminating hallucination in large language models is not a matter of more compute, better training data, or more careful fine-tuning. It is mathematically impossible given the fundamental architecture. Any system that generates text by predicting probable sequences from learned statistical distributions will, by mathematical necessity, sometimes produce outputs not grounded in fact.
Karpowicz, working at Samsung AI Center Warsaw in 2025, approached the same question from three independent mathematical frameworks and reached the same conclusion each time. No LLM inference mechanism can simultaneously satisfy all four of these properties:
- Truthful response generation — always producing factually correct output
- Knowledge conservation — retaining everything learned during training
- Relevant knowledge revelation — surfacing applicable knowledge per query
- Constrained optimality — operating within token prediction constraints
Source: Karpowicz, Samsung AI Center Warsaw, 2025, cited across multiple independent benchmark reviews. NervNow recommends verifying against the original paper before citing in published work. Secondary sources: axis-intelligence.com, suprmind.ai.
The implication is not that AI is useless or unreliable in all contexts. It is that the architecture has a structural ceiling on truthfulness that does not move regardless of how good the model becomes at everything else.
The reasoning paradox
The finding that complicates this further, and that most enterprise AI strategies have not yet absorbed, is that the models being marketed most aggressively as more capable are in several measurable ways worse on this specific dimension.
In 2025, benchmarking against the PersonQA dataset, which tests factual recall about specific individuals, produced a result that cut against the prevailing assumption that more reasoning capability means fewer errors.
Source: PersonQA benchmark results, 2025. Via aboutchromebooks.com and axis-intelligence.com. NervNow recommends verifying against OpenAI’s published evaluations before citing.
A reasoning model builds longer internal chains of inference before producing an answer. That process is genuinely useful for complex, multi-step problems. But those longer chains also create more opportunities for the model to generate plausible-sounding intermediate steps that are not grounded in fact, and to arrive at wrong answers with the full structural confidence of a well-reasoned conclusion.
OpenAI’s own research paper on the subject, published in September 2025, described the incentive structure clearly: next-token prediction training objectives and common leaderboard benchmarks penalize “I don’t know” responses, which means models are implicitly trained to bluff rather than abstain. The result is models architecturally disposed toward confident wrong answers rather than calibrated uncertainty.
Where the problem is worst
General-purpose hallucination rates on well-supported factual queries have fallen across leading models, and the headline numbers vendors publish reflect real progress. The problem is sharpest precisely where the stakes are highest. Select a domain below.
On legal-specific queries, large language models hallucinate between 69 and 88 percent of the time, according to research from Stanford RegLab and the Stanford Human-Centered AI Institute. That range is not a rounding error. It reflects a structural gap between what the models were trained on, broad text corpora, and what legal analysis demands: precise, verifiable, jurisdiction-specific claims about specific statutes, cases, and precedents.
The consequences are not theoretical. Multiple lawyers have been sanctioned by courts in the U.S. and U.K. for submitting AI-generated briefs citing cases that do not exist. The citations were formatted correctly. The citations were entirely fabricated.
ECRI’s 2026 Top 10 Health Technology Hazards Report ranked AI chatbot misuse as the single greatest health technology hazard of the year, noting that over 40 million people consult AI for health information daily.
Clinical decision support is the high-stakes end of this. A model that fabricates a drug interaction, a contraindication, or a dosage range with the same confidence it would use for a well-established clinical fact produces an output that is indistinguishable from a correct one unless verified against a primary source. The verification step most systems are missing.
Financial compliance analysis depends on precise citation of regulations, rulings, and precedents in exactly the same way legal analysis does. The hallucination risk profile is comparable. A compliance report that cites a regulatory requirement that does not exist, or mischaracterizes one that does, has an error that can pass undetected until an examiner or counterparty catches it.
The $67.4 billion global cost figure reported by AllAboutAI in 2025 is not dominated by dramatic public failures. Most of it accumulates in compounded errors that travel through compliance chains before anyone identifies where the fabrication entered.
The KPMG and Deloitte cases are the most public demonstrations of what happens when AI-generated content is treated as production-ready without citation-level verification. Both reports passed through full professional review chains before publication. Neither chain was designed to catch AI hallucinations.
GPTZero CEO Edward Tian identified the systemic consequence: publications by firms with this kind of institutional credibility are referenced by other publications, industry analyses, and board presentations without re-verification. A hallucinated citation in a KPMG report can become a sourced fact in a dozen subsequent documents before the original error is discovered.
What actually reduces it
The vendors are not misrepresenting the progress when they report falling hallucination rates. The improvements are real. But there is a meaningful difference between reducing frequency and eliminating the underlying cause, and the interventions that actually work make that distinction visible.
| Method | Reduction | What it addresses |
|---|---|---|
| Retrieval-augmented generation (RAG) | 75–90% | Constrains generation to retrieved, verifiable sources; anchors output to documents you control |
| Tool grounding | 65–80% | Ties specific claims to external verified outputs rather than statistical prediction |
| Self-consistency checking | ~65% | Identifies internal contradictions across multiple generations before output is surfaced |
| Prompt engineering alone | ~15% | Surface-level behavioral instruction only; does not change the underlying generative process |
The gap between those numbers tells the story. Prompt engineering, the most common intervention and the one most organizations reach for first, produces the weakest result by a factor of five to six. Telling the model to be more careful does not change the underlying generative process. The interventions that work are architectural, not behavioral.
Anthropic’s research on internal concept vectors, published in its work on tracing the thoughts of large language models, demonstrated how models like Claude can be trained to learn when not to answer, turning refusal into a learned policy rather than a fragile prompt instruction. That is a genuine architectural advance. But even that research does not claim to eliminate hallucination. It reduces confident wrong answers in cases where the model has learned to recognize its own uncertainty. Cases it has not learned to recognize remain.
Retrieval-augmented generation is not a complete solution either. Even with retrieval grounding, hallucination rates are non-zero. Models can misread retrieved documents, over-generalize from them, or fabricate claims about content that was present but ambiguous. The Karpowicz impossibility proof applies to all architectures, including hybrid ones.
What leaders should do
The organizational lesson from KPMG and Deloitte is not that AI cannot be used in high-stakes work. It is that the verification infrastructure around AI has to be designed to match the deployment context, and in both cases it was not.
In the Deloitte case, GPT-4o via Azure OpenAI was used as part of the technical workstream for a 237-page independent assurance review of a government welfare compliance system. No organizational contingency existed for validating citation quality before publication. A single Sydney University researcher, reading the report out of professional interest, identified fabrications that had passed through Deloitte’s entire production chain undetected.
The structural requirement this creates for enterprise leaders is straightforward to describe, if not always simple to implement. AI deployment in high-stakes output contexts requires verification infrastructure proportionate to the consequence of a wrong answer. That infrastructure is not a prompt. It is a process.
A document can read coherently and be comprehensively wrong. Coherence review does not catch hallucinations. Citation verification does.
For organizations using AI in document generation, research synthesis, compliance analysis, or any output that will be cited by others, the minimum viable architecture involves grounding generation to verified source documents rather than open-ended generation, span-level citation verification that checks each generated claim against retrieved evidence, and human review structured to catch fabrications rather than overall coherence.
For procurement and vendor evaluation, the question that is currently underasked is not what this model’s hallucination rate is on general benchmarks, but what its hallucination rate is on the specific type of output being generated, in the specific domain, with the specific verification infrastructure in place. Those are four different questions and only the last one is relevant to operational risk.
The answer to AI hallucinations is not a better model. The research establishes that clearly. The answer is an organization that has designed its AI deployments around the fact that the model will sometimes be wrong, that it will not always know it is wrong, and that the confidence with which it is wrong is indistinguishable, on the surface, from the confidence with which it is right. Building for that reality is not a limitation on what AI can do. It is the condition under which AI can be trusted to do anything at all.
Get the next NervNow analysis in your inbox
Reporting on the companies, people and ideas shaping enterprise AI, sent direct.
Subscribe to the newsletterResearched and written by NervNow Editorial.
Sources & method
All claims are traced to primary reporting or published research. KPMG and Deloitte case details sourced from The Register, GPTZero’s published investigation, Swiss Info, Fortune, CFO Dive, and Futurism. Chris Rudge quote via Futurism citing the Associated Press. Edward Tian quote via Swiss Info. PersonQA benchmark figures (o1 16%, o3 33%, o4-mini 48%) via aboutchromebooks.com and axis-intelligence.com; NervNow recommends verifying against OpenAI’s published evaluations. Mitigation reduction figures (RAG 75–90%, tool grounding 65–80%, self-consistency ~65%, prompt engineering ~15%) via digitalapplied.com April 2026 benchmark as cited in axis-intelligence.com. Stanford RegLab legal hallucination range (69–88%) via axis-intelligence.com. ECRI 2026 health hazard ranking and 40 million figure via axis-intelligence.com. $67.4 billion cost figure via AllAboutAI 2025 as reported across multiple sources. Karpowicz (2025) and Xu et al. (2024) mathematical proofs cited via secondary sources including axis-intelligence.com and suprmind.ai; NervNow recommends verifying against the original papers. OpenAI September 2025 paper on hallucination incentive structure via Lakera blog. Anthropic concept vectors research via Lakera blog. Figures may vary across sources or change after publication. To flag a correction, write to editorial@nervnow.com.







