The System Hallucination Scale

Not all hallucinations are equal. A generative AI system that invents a citation is a different kind of problem from one that subtly misrepresents a clinical finding. The System Hallucination Scale exists to make that distinction measurable.

The problem with “sometimes wrong”

When AI systems began failing in public — fabricated court cases, invented medical references, confident assertions about events that never happened — the response was predictable: build better benchmarks, measure more dimensions, run more tests.

What that response missed is a more fundamental question: from whose perspective are we measuring?

Existing hallucination benchmarks are largely designed around automatic detection — does the output contradict a ground-truth source? That is a useful signal, but it is not the whole picture. What matters in deployment is not just whether a model produces false content, but how that falseness manifests under realistic interaction conditions — whether it is detectable, how users respond to it, and whether the system adapts when challenged.

These are not questions an automated metric can answer. They require a human-centered instrument.


What the SHS measures

The SHS is a lightweight questionnaire instrument designed to assess hallucination-related behavior in large language models from the perspective of a user in realistic interaction. It is explicitly not an automatic hallucination detector or benchmark metric. It captures how hallucination phenomena manifest from a user perspective.

The instrument evaluates four dimensions:

Factual Unreliability

Does the system produce content that is factually incorrect or fabricated? This is the dimension most existing benchmarks target, but the SHS treats it as one factor among several, not the whole story.

Incoherence

Does the output maintain logical and narrative consistency within and across turns? A system can be locally accurate but globally incoherent — a different class of risk that accuracy metrics miss entirely.

Misleading Presentation

Does the system present uncertain or contested information with inappropriate confidence? This captures the framing problem: outputs that are technically not false but are epistemically irresponsible. In clinical or legal contexts, this dimension is often more consequential than outright fabrication.

Responsiveness to User Guidance

When a user identifies an error or pushes back, does the system correct, persist, or capitulate indiscriminately? A system that sycophantically agrees with any correction may be as dangerous as one that stubbornly persists in error. This dimension is critical for high-stakes applications.

Together, these four dimensions produce a profile, not just a score. A system that scores poorly on factual unreliability but well on responsiveness is a different deployment risk from one with the reverse pattern — and should trigger different mitigation strategies.


Why existing benchmarks miss the point

Standard hallucination benchmarks share a common architecture: present a model with prompts that have known correct answers, measure the proportion of correct responses, compare across models.

This approach has two structural limitations.

First, it measures performance under controlled conditions. Real deployment is not controlled. Users ask ambiguous questions, provide misleading context, and interact over multiple turns in ways that no static benchmark anticipates.

Second, it treats hallucination as a binary property — a response is either accurate or it is not. The SHS treats hallucination as a multidimensional phenomenon with variable severity, because that is what it actually is. A fabricated citation in a legal brief and a fabricated citation in a casual recommendation are not the same problem, even if both register identically in an accuracy metric.


Validation

The SHS was validated in a real-world study with 210 participants, demonstrating high clarity, coherent response behavior, and construct validity. Internal consistency was strong (Cronbach’s α = 0.87), with significant inter-dimension correlations confirming that the four dimensions capture related but distinct aspects of hallucination behavior.

The instrument is designed to be completed in two to three minutes per interaction session, without requiring technical expertise. This means domain experts — clinicians, legal professionals, analysts — can serve as evaluators directly, rather than relying on AI researchers as proxies.


The interactive SHS calculator

You can apply the scale to your own system using the interactive SHS calculator on this site. Enter your ratings across the four dimensions and receive an interpreted score with guidance on what the profile means for deployment decisions.

The full scoring algorithm, including a Python reference implementation, is available in the paper’s supplementary material.


Measurement is the prerequisite

We are at a stage in AI deployment where the pressure to ship is enormous and the tools for systematic evaluation are underdeveloped. The SHS does not solve that imbalance, but it shifts the calculus in one important direction: it gives teams a structured, human-centered instrument that does not require a research background to apply.

In high-stakes domains — medicine, law, any field where AI-assisted decisions have consequences for human lives — the question is not whether AI systems hallucinate. They do. The question is whether we understand the character of that hallucination well enough to manage it.

Measurement is not the same as mitigation. But it is the prerequisite.


Reference: Müller H, Steiger D, Plass M, Holzinger A. The System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models. arXiv:2603.09989 (2026). https://doi.org/10.48550/arXiv.2603.09989