A Technical Review of LLM Errors and Attribution Frameworks

The rapid integration of Large Language Models (LLMs) into agentic systems, i.e., AI entities capable of planning, using tools, and executing multi-step tasks, has surfaced a fundamental technical bottleneck: hallucinations.

In clinical pathology, a hallucination is defined as a sensory perception in the absence of an external stimulus. In the context of Artificial Intelligence, the term is borrowed to describe a similar phenomenon: the generation of content that is syntactically coherent but factually incorrect, nonsensical, or unfaithful to the provided source material.

For enterprise-grade autonomous agents, these errors are far more than quirks or creative liberties. They represent reliability failures that can lead to catastrophic downstream effects, such as:

API Misconfiguration
An agent hallucinating a non-existent parameter for a software tool.
Financial Damage
Generating incorrect facts which lead to bad decisions.
Data Corruption
Generating false records when summarizing internal databases.
Erosion of Trust
Providing confidently wrong answers to end-users in high-stakes environments like law or medicine.

To build resilient systems, we must move beyond viewing hallucinations as random glitches. Instead, we must treat them as measurable phenomena rooted in the model's probabilistic architecture. This blog provides a technical review of the taxonomy of hallucinations, explores the mathematical mechanics of why they occur, and examines some of the latest scientific frameworks used to attribute these errors to their origins.

Defining the Hallucination Problem

At its core, a hallucination is a divergence between the model’s output and a ground truth. This divergence is not monolithic; it may generally be categorized into distinct categories based on where the disconnect occurs, some of which are:

Factuality Hallucinations (External Inconsistency)
The model generates information that contradicts real-world facts. This usually happens when the model's internal "knowledge" (learned during training) is outdated, incomplete, or incorrectly retrieved.
Faithfulness Hallucinations (Internal Inconsistency)
The model contradicts the information provided in the immediate prompt or context. For example, if you provide a legal document and the model summarizes a clause that doesn't exist, it has failed to be "faithful" to the source (often referred to as Intrinsic Hallucination).
Logic / Reasoning Hallucinations (Intermediary Failures)
The model produces outputs that are internally inconsistent or logically incoherent, even if grammatically correct. A critical subclass of this is intermediary hallucination within reasoning chains. When using "Chain-of-Thought" (CoT), a model may hallucinate a false intermediate step to "rationalize" a lack of knowledge, creating a coherent but entirely fabricated justification for an incorrect answer. More concerning, these intermediary hallucinations may remain "silent" (i.e., not visible to the end user) and generate an answer that is not incorrect itself but sub-optimal in some sense.

Some Examples

To understand the breadth of this issue, consider the following scenarios:

Type	User Prompt	Hallucinated Response	Ground Truth
Factuality	"Who won the 2024 Super Bowl and what was the score?"	"The San Francisco 49ers won with a score of 24-21."	The Kansas City Chiefs won 25-22.
Faithfulness	[Uploads PDF of a 2023 financial report] "What was the Q3 revenue?"	"The Q3 revenue was $4.2B."	The report states Q3 revenue was $3.8B; $4.2B was the projection for Q4.
Logic/Reasoning	"Alice is friends with Bob. Bob is friends with Charlie. Is Alice friends with Charlie?"	"Yes, since they share a mutual friend, they are friends."	Unknown. Friendship is not transitive; they might not know each other.
Logic/Reasoning (Silent Error)	"Optimize the delivery route for these 10 locations to minimize total fuel consumption."	"I have generated the route by using global optimization algorithm. The total cost will be $1,250."	Optimal: Use an optimization algorithm with correct implementation, leading to a cost of $1,000. The intermediary reasoning correctly identified the need to use an optimization algorithm but failed to implement it properly. This led to an acceptable answer (i.e., visiting all required locations) but sub-optimal.

Chain Reaction in Agentic Systems

In a standard chatbot, a hallucination is a conversational annoyance. However, in agentic systems, the output of one step is the input for the next. If an agent is tasked with "Refunding the customer with the highest purchase value," and it hallucinates a name or an ID that doesn't exist, the system will attempt to trigger an API call with invalid data. This results in execution errors, or worse, "silent failures" where the agent performs the wrong action entirely, such as refunding the wrong person, without the user realizing a mistake occurred.

Sources of Hallucinations: A Probabilistic View

Figure 1: The Hallucination Gap (Token Probability) - Illustrative Example.

Scenario: User asks "Who won the 2024 Super Bowl?"

Probability Mass P(θ)

Ground Truth

35%

"The winner ... Chiefs"(Fact)

Hallucination

62%

"The winner ... 49ers"(Selected)

"The..."(Noise)

The model may select "49ers" because it has a higher learned probability weight in the training data context (e.g., pre-game predictions), overpowering the grounded fact.

As formalized in the framework by [Anh-Hoang et al.], the problem of hallucination can be described mathematically within the probabilistic generative framework that underlies modern language modeling. Consider an LLM as a probabilistic generator $P_{\theta}(y|x)$ parameterized by $\theta$ , where $x$ denotes the input prompt and $y$ denotes the generated output. Hallucinations emerge when the model assigns a higher probability to an incorrect or ungrounded generation sequence compared to a factually grounded alternative.

The generation process is a conditional probability problem. Given a prompt sequence $x = (x_1, x_2, \dots, x_n)$ , the model generates an output sequence $y = (y_1, y_2, \dots, y_m)$ by predicting one token at a time. The joint probability of the entire output string is defined by the product of the conditional probabilities of each token:

P_{\theta}(y|x) = \prod_{t=1}^{m} P_{\theta}(y_t|x, y_{<t})

In this equation, $y_{<t}$ represents all tokens generated prior to the current step $t$ . A hallucination occurs when the model’s learned parameters $\theta$ assign a higher probability to a false sequence ( $y_{hal}$ ) than to a factually grounded one ( $y_{grounded}$ ):

P_{\theta}(y_{hal} | x) > P_{\theta}(y_{grounded} | x)

This mathematical reality highlights that hallucinations are not glitches in the traditional sense, but rather the model performing exactly as designed: maximizing the probability of the next token based on its training distribution, even if that path leads away from objective truth.

Because the output is a function of both the input provided and the internal weights of the system, these errors can be categorized into two distinct sources: the prompt or the model itself. Distinguishing between these sources reveals whether the failure is a result of the specific instruction $x$ failing to activate the right information, or a deeper deficiency within the parameters $\theta$ that define the model's knowledge base.

Model-Driven Hallucinations

Model-driven hallucinations are rooted in the training phase and the inherent architecture of the model. Because LLMs compress massive amounts of internet-scale data into a finite set of parameters, the process naturally results in some loss of information. This compression can lead to noise, where the model conflates related but distinct concepts or prioritizes high-frequency linguistic patterns over low-frequency factual accuracy. Furthermore, it is common to use the prediction error on the next token as an objective function in foundational models, which rewards linguistic fluency and structural coherence rather than logical verification. Consequently, the model may generate a response that "sounds" correct because it follows a common syntactic path found in the training data, even if the underlying information is false.

Prompt-Driven Hallucinations

Prompt-driven hallucinations occur during the inference stage and are a result of how the model processes the specific input sequence $x$ . When a prompt is ambiguous, overly complex, or contains false premises, the attention mechanism may shift its probabilistic weight toward irrelevant or incorrect tokens. This can be exacerbated by some decoding strategies. For example, the "greedy decoding" (selecting the highest probability token at each step) is a clear example of how probability maximization leads to logical dead-ends. Even with sophisticated strategies, the model can still be "forced" into a hallucination if the initial tokens ( $y_{<t}$ ) set a trajectory where the only "coherent" next step is a fabrication. Once an incorrect token is generated, it becomes part of the context for all future steps, effectively forcing the model to maintain consistency with its own initial error.

Mitigation Strategies

Reducing hallucinations requires a dual-track approach: refining the model’s internal probability distributions and providing the model with external cognitive support during inference. These strategies aim to bridge the gap between statistical prediction and factual verification.

Technical Interventions and Model Alignment

A robust way to reduce model-driven hallucinations is to modify the underlying data or the model's behavior through technical interventions. Retrieval-Augmented Generation (RAG) is currently widely used in the industry for this task. Instead of relying solely on the weights learned during training , RAG allows the model to query an external, authoritative database or vector store before generating a response. By shifting to this architecture, the model conditions its output on retrieved documents, significantly narrowing the probability space for false information.

Figure 2: Retrieval Augmented Generation (RAG) Flow (Simplified)

The retrieval system injects factual context from the Knowledge Base into the prompt before it reaches the model, restricting the probability space to grounded facts.

Complementary to RAG is Reinforcement Learning from Human Feedback (RLHF). This process occurs during the post-training phase, where human annotators rank various model outputs based on accuracy and safety. Through algorithms like Proximal Policy Optimization (PPO) or Direct Preference Optimization (DPO), the model's parameters are updated to penalize hallucinated paths and reward factual ones. This effectively reshapes the learned probability landscape, making the model more likely to prioritize grounded responses even when the training data might suggest a more frequent, but incorrect, linguistic pattern.

Prompt Engineering and Inference Guardrails

While technical interventions change the model's baseline behavior, prompt engineering focuses on optimizing the input sequence $x$ to guide the generation of $y$ . These techniques provide the model with a better "logical map" to follow during inference.

An effective method is Chain-of-Thought (CoT) prompting. By instructing the model to think step-by-step, the architect forces the generation of intermediate reasoning tokens. This allows the model to break down complex queries into smaller, verifiable logical chunks, reducing the likelihood of a logical leap that results in a hallucination. However, CoT is not without risks. Research indicates that if the model fundamentally lacks the knowledge to answer a query, CoT can induce intermediary hallucinations, where the model generates a detailed but fabricated reasoning path to rationalize a false conclusion [Turpin et al.]. In these cases, the reasoning steps act as a mechanism to make the hallucination more elaborate and convincing rather than correcting it.

Similarly, Few-Shot Learning provides the model with a small set of high-quality examples within the prompt. These examples act as a style and fact guide, aligning the model's conditional probability with the desired output format and accuracy level before it attempts the target task.

Furthermore, implementing Role and Format Guardrails establishes strict boundary conditions for the model. By defining a specific persona (e.g., "You are a fact-checking assistant") and providing explicit "uncertainty thresholds" (e.g., "If the answer is not in the context, output UNKNOWN"), developers can prevent the model from defaulting to its most likely, but potentially false, guess.

Summary of Some Mitigation Techniques

Category	Technique	Objective
Model	RAG	Grounds responses in external, validated data.
Model	RLHF	Aligns model behavior with human-verified truth.
Prompt	Chain-of-Thought & Other Thoughts Prompting	Improves multi-step logical reasoning accuracy.
Prompt	Few-Shot	Calibrates the model via contextual examples.
Prompt	Guardrails	Enforces JSON formats and "UNKNOWN" fallback states.

Benchmarking and Datasets

To move beyond anecdotal evidence of model errors, the AI research community relies on standardized benchmarks to quantify hallucination rates. These datasets provide controlled environments where model outputs can be measured against verified ground truths. By using these benchmarks, developers can assign a score to a model's reliability before deploying it into a production setting.

Key Benchmarks for Factuality and Faithfulness

TruthfulQA is perhaps the most well-known benchmark in this space. It is specifically designed to provoke imitative falsehoods, errors that models make because they have learned human false beliefs or common misconceptions during training. Rather than simple trivia, it asks questions like, "What happens if you smash a mirror?" If a model responds with "You will have seven years of bad luck," it fails the test, as it is mimicking a human superstition rather than providing a factual answer.

HaluEval shifts the focus toward the model's self-awareness and reasoning consistency. This dataset is often used to test whether a model can identify contradictions within its own generated text or within a provided context. It is a critical tool for assessing faithfulness, ensuring that if a model is given a specific set of constraints, it stays within those logical bounds rather than drifting into ungrounded territory.

While RealToxicityPrompts is frequently associated with AI safety and ethics, it serves a secondary role in hallucination research. It measures toxic drift, where a model begins with a neutral prompt but eventually generates ungrounded, inflammatory, or factually aggressive content. This helps researchers understand how a model's internal probability distribution might favor controversial or sensational sequences over neutral, factual ones.

Finally, QAFactEval provides a more granular, multi-stage framework for evaluating factuality in long-form generation. Instead of a simple "pass/fail" on a short sentence, it breaks down long responses into individual claims and uses a separate evaluator model to verify each claim against a reference document. This is particularly useful for summarizing large datasets or technical manuals where precision is non-negotiable.

Attributing Origins: The PS vs. MV Framework

The diagnostic framework proposed by [Anh-Hoang et al.] in their survey, Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior, provides a quantitative methodology for isolating the root causes of LLM hallucinations. By distinguishing between Prompt Sensitivity (PS) and Model Variability (MV), researchers can determine whether a hallucination is a result of suboptimal instruction design or inherent limitations in the model's parameters.

Attribution Matrix: PS vs. MV

MODEL VARIABILITY (MV) →

PROMPT SENSITIVITY (PS) →

Model Driven

Prompt Driven

Mixed Origin

DeepSeek

OpenChat-3.5

Gwen

Mistral 7B

LLaMA 2

Figure 3. Diagnostic Framework. Points represent relative positioning based on benchmark data. Data adapted from Anh-Hoang et al. (2025).

Prompt Sensitivity (PS)

Prompt Sensitivity (PS) is a metric that measures the variation in output hallucination rates under different prompt styles for a fixed model. If minor semantic shifts in a prompt lead to significant changes in factual accuracy, the model is considered highly sensitive. This suggests that while the model may contain the correct information, its retrieval mechanism is unstable.

The mathematical representation of this sensitivity is defined as:

PS = \frac{1}{n}\sum_{i=1}^{n}|H_{P_{i}}^{M}-\overline{H}^{M}|

In this equation, $H_{P_{i}}^{M}$ represents the hallucination rate for prompt $P_{i}$ on model $M$ and $\overline{H}^{M}$ denotes the average hallucination rate across prompts. A high PS value indicates that the model is unstable; its performance is contingent on specific phrasing, meaning hallucination mitigation should focus on prompt optimization and refinement.

Model Variability (MV)

Conversely, Model Variability assesses whether a hallucination is consistent across different architectures. This metric determines if a specific task or knowledge point represents a "universal failure" in current AI training or if it is unique to a specific model's weights.

The formula for Model Variability is expressed as:

MV = \frac{1}{m}\sum_{j=1}^{m}|H_{P}^{M_{j}}-\overline{H}^{P}|

H_{P}^{M_{j}}

P

M_j

\overline{H}^{P}

High Model Variability (MV): Indicates that the hallucination is model-intrinsic.
Low Model Variability (MV): When combined with low Prompt Sensitivity (PS), this typically categorizes the error as unclassified or due to stochastic noise.

Empirical Findings from the Litterature

[Anh-Hoang et al.] empirically analyzed these behaviors, explicitly classifying major foundation models into the following quadrants based on their performance on benchmarks like TruthfulQA:

Llama 2 (13B)
Llama 2 (13B) exhibits a higher PS profile. This suggests that the model often possesses the requisite knowledge within its parameters but requires precise instruction to "unlock" it. Because its accuracy is highly reactive to wording, a developer can significantly reduce hallucination rates through prompt engineering or clarifying instructions. In this instance, the model is not necessarily lacking knowledge; it is simply susceptible to being led astray by the linguistic framing of the query.
DeepSeek (67B)
In contrast, DeepSeek tends to exhibit higher MV, indicating that its hallucinations are more consistent regardless of how a query is phrased. This suggests that the errors are not a result of a misunderstanding of user intent, but rather represent fundamental "blind spots" in the model's internal knowledge base. For such models, prompt engineering offers less returns. Instead, mitigation strategies must rely on external data integration, such as Retrieval-Augmented Generation (RAG), to provide the information missing from the model’s training data.
Mistral 7B
Mistral 7B exhibits balanced behavior across dimensions, categorizing it as a "Mixed-origin" model. While instruction tuning has made it relatively responsive to prompts, it still requires well-structured prompts to perform optimally.
Gwen
Like Mistral 7B, Gwen is categorized as "Mixed-origin". It performs reasonably well with straightforward prompts but is prone to hallucinations if the prompt is tricky or if the query targets a specific weakness in the model.
OpenChat-3.5
OpenChat-3.5 is also classified as "Mixed-origin". It displays a high Prompt Sensitivity, similar to LLaMA 2, combined with moderate Model Variability, suggesting it benefits from both improved prompts and further model fine-tuning.

Conclusion

Addressing hallucinations in Large Language Models requires moving beyond a binary view of correctness toward a granular understanding of failure mechanisms. As explored through the lenses of Factuality, Faithfulness, and Reasoning, not all errors stem from the same root cause, nor do they demand the same intervention. A hallucination born from probabilistic instability requires a different architectural response than one rooted in fundamental knowledge gaps or logical incoherence.

For practitioners, the path to reliable agentic systems lies in this diagnostic precision. By correctly categorizing errors, whether they are retrieval failures, context inconsistencies, or reasoning shortcuts, developers can allocate resources efficiently, selecting the precise mitigation strategy required, be it prompt refinement, RAG infrastructure, or advanced fine-tuning.

Ultimately, while current metrics provide a solid foundation for risk assessment, the field is rapidly evolving. Continued research into attribution frameworks will be critical in further demystifying model behavior, unlocking the levels of reliability and robustness required for the high-stakes demands of real-world production environments.

References

Ji, Z., Lee, N., Frieske, R., Yu, T.-H. K., Su, D., Xu, Y., et al. (2023). "Survey of hallucination in natural language generation." ACM Computing Surveys. [Ji et al.]
Anh-Hoang, D., Tran, V., and Nguyen, L. M. (2025). "Survey and analysis of hallucinations in large language models: attribution to prompting strategies or model behavior." Frontiers in Artificial Intelligence.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., et al. (2022). "Chain-of-thought prompting elicits reasoning in large language models." arXiv preprint arXiv:2201.11903. [Wei et al.]
Turpin, M., Michael, J., Perez, E., & Bowman, S. R. (2023). "Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting." arXiv preprint arXiv:2305.04388. [Turpin et al.]
Lin, S., Hilton, J., and Askell, A. (2022). "Truthfulqa: Measuring how models mimic human falsehoods." arXiv preprint arXiv:2109.07958. [Lin et al.]
Li, J., Cheng, X., Zhao, W. X., Nie, J.-Y., and Wen, J.-R. (2023). "HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models." arXiv preprint arXiv:2305.11747. [Li et al.]

Decoding LLM Hallucinations