The Subjective Hallucination Scale

Not all AI errors are equal. Some are harmless mistakes; others are dangerous confabulations that erode trust and lead to poor decisions.

At HMMC, we’ve developed the Subjective Hallucination Scale (SHS) to address one of the most pressing challenges in human-AI collaboration: distinguishing between honest uncertainty and dangerous overconfidence in AI-generated content. This framework helps teams calibrate trust and make better decisions when working with AI systems by providing a human-centered approach to evaluating AI outputs.


The hallucination problem

Modern AI systems can produce outputs that are factually incorrect but presented with high confidence, misleading through selective information or framing, confabulatory by fabricating details that seem plausible, or contextually inappropriate for the specific use case. These different types of errors have vastly different impacts on human users and decision-making quality.

Traditional evaluation metrics like accuracy, precision, and recall miss these crucial nuances. They don’t capture the cognitive impact on human users or the trust implications of different types of errors. A system that confidently provides wrong information is far more dangerous than one that hesitantly provides correct information, yet traditional metrics treat them similarly.

The real challenge lies in understanding how AI outputs affect human cognition, decision-making, and trust calibration. We need evaluation frameworks that capture not just what AI systems produce, but how those outputs influence human behavior and judgment.


The SHS framework

Our Subjective Hallucination Scale evaluates AI outputs across four integrated dimensions that capture the full spectrum of AI-human interaction quality:

1. Factual AccuracyIs the information correct?

Factual accuracy forms the foundation of trustworthy AI systems, but it’s more complex than simple right-or-wrong evaluation. This dimension examines verification against ground truth sources, citation quality and source reliability, temporal accuracy and currency of information, and domain-specific factual validation.

The key insight is that factual accuracy isn’t just about correctness—it’s about the reliability and verifiability of the information provided. AI systems should not only be correct but should also provide clear signals about the confidence and source of their information.

2. Confidence CalibrationDoes confidence match accuracy?

Confidence calibration is perhaps the most critical dimension for human-AI collaboration. This dimension evaluates overconfidence detection, uncertainty acknowledgment, appropriate hedging language, and confidence interval accuracy.

Well-calibrated AI systems provide confidence signals that accurately reflect their actual performance. This means users can develop appropriate trust levels and make better decisions about when to rely on AI outputs versus when to seek additional verification or human judgment.

3. Contextual AppropriatenessIs this the right response for this situation?

Contextual appropriateness examines whether AI outputs are suitable for the specific situation and user needs. This includes task alignment and relevance, user intent understanding, appropriate level of detail, and cultural and situational sensitivity.

The same factual information can be appropriate or inappropriate depending on the context. A detailed technical explanation might be perfect for an expert audience but overwhelming for novices. Contextual appropriateness ensures that AI systems provide outputs that are not just correct, but also useful and appropriate for their intended audience and use case.

4. Cognitive ImpactHow does this affect human decision-making?

Cognitive impact focuses on how AI outputs influence human cognition, decision-making, and learning. This dimension measures trust calibration effects, decision quality impact, metacognitive reflection support, and learning and adaptation facilitation.

This dimension recognizes that AI systems don’t just provide information—they shape how humans think, learn, and make decisions. Effective AI systems should enhance human cognitive capabilities rather than replace or diminish them.


Measuring trustworthiness from user perspective

The SHS goes beyond technical metrics to capture user-centered trustworthiness by focusing on how AI outputs actually affect human users in practice. This includes comprehensibility measures that assess whether users can understand what the AI is saying, actionability evaluation that determines whether outputs support effective decision-making, calibration assessment that tracks whether users develop appropriate trust levels, and recovery evaluation that measures whether users can detect and correct AI errors.

This user-centered approach recognizes that AI systems must be evaluated not just on their technical performance, but on their ability to enhance human capabilities and support effective human-AI collaboration. The goal is to create AI systems that earn and maintain human trust through appropriate confidence and transparent communication about limitations.


Practical applications

We apply SHS evaluation across diverse domains including clinical decision support systems where accuracy and confidence calibration are critical for patient safety, financial analysis tools where overconfidence can lead to costly mistakes, educational content generation where contextual appropriateness determines learning effectiveness, and operational planning assistance where cognitive impact affects decision quality.

The framework helps organizations calibrate user expectations about AI capabilities by providing clear signals about system performance and limitations. It supports the design of better interfaces that expose uncertainty appropriately, enabling users to make informed decisions about when to trust AI outputs. The SHS also facilitates the development of training programs for human-AI collaboration by identifying specific areas where users need support in working effectively with AI systems.

Additionally, the framework enables organizations to monitor system performance from a user perspective, providing insights that go beyond traditional technical metrics to capture the actual impact of AI systems on human users and decision-making quality.


Implementation insights

Key lessons from SHS deployment across various domains reveal important patterns in human-AI interaction. Context matters significantly—the same output can be appropriate or inappropriate depending on the specific use case, user expertise level, and situational requirements. This highlights the importance of designing AI systems that can adapt their outputs to different contexts and user needs.

User expertise varies dramatically, and this variation affects how AI outputs should be presented and explained. Novices and experts need different types of explanations, different levels of detail, and different confidence signals. Effective AI systems must recognize and adapt to these differences in user expertise and needs.

Trust is dynamic and should evolve with user experience. New users may be overly trusting or overly skeptical, but their confidence levels should calibrate over time as they gain experience with AI systems. The SHS helps track this calibration process and identify when additional support or training might be needed.

Recovery is crucial for maintaining trust and effectiveness. AI systems will make errors, and users need to be able to detect and correct these errors effectively. Systems need graceful degradation capabilities and clear error correction mechanisms that help users recover from AI mistakes without losing confidence in the overall system.


Building trustworthy AI systems

The Subjective Hallucination Scale helps organizations build AI systems that earn and maintain human trust—not through perfect accuracy, but through appropriate confidence and transparent communication about limitations. By focusing on user-centered evaluation and the cognitive impact of AI outputs, the SHS provides a framework for creating AI systems that enhance rather than replace human judgment.

The key insight is that trustworthy AI systems are those that help humans make better decisions by providing appropriate confidence signals, clear explanations, and graceful error recovery mechanisms. This requires ongoing evaluation and improvement based on how AI outputs actually affect human users in practice.


The Subjective Hallucination Scale helps organizations build AI systems that earn and maintain human trust—not through perfect accuracy, but through appropriate confidence and transparent communication about limitations.

References:

  • HMMC Research: “Subjective Hallucination Scale: A Human-Centered Framework for AI Evaluation” (2024)
  • Related work: Human-AI collaboration, trust calibration, metacognitive AI systems