Is Human Oversight of AI Still Possible?

Regulations are being written for the AI of yesterday. The agentic AI of today operates in a space where the very concept of oversight is being redefined — often faster than regulators can type.

In December 2024, we published an editorial in New Biotechnology asking a deceptively simple question: Is human oversight of AI systems still possible?[^1] The paper emerged from our work on the special issue “Artificial Intelligence for Life Sciences,” and what began as a governance question quickly became something more unsettling — a structural diagnosis of where the entire AI oversight project may be breaking down.

The honest answer is this: complete oversight, in the classical sense, is no longer viable. But that is not the same as saying oversight is impossible. It means we need a fundamentally different theory of what oversight is for and how it works.

The Regulation Is Real. The Problem Is Also Real.

The European AI Act is now law. The UN has passed resolutions. Governments are drafting guidelines at a pace that would have seemed remarkable just three years ago. And yet, while this regulatory machinery spins up, the AI landscape it is trying to govern has fundamentally shifted. The black-box deep learning models that worried us in 2020 have given way to large language models capable of generating persuasive text, code, and synthetic data at scale. And now, agentic AI — systems that plan, execute multi-step tasks, and call other AI systems as tools — is moving from research labs into production deployments.

The regulatory frameworks being written today are largely frameworks for the AI of yesterday. The mismatch is not a detail. It is the central problem.

Figures struggling to navigate stormy seas — Entering stormy times — Detail from De Zweedse jacht Lejonet op het IJ voor Amsterdam, Ludolf Backhuysen, 1674 — photographed and digitally enhanced by Heimo Müller

What Classical Oversight Assumed

Traditional human oversight of automated systems rested on three assumptions that no longer reliably hold.

It assumed that a human expert could, in principle, understand the system’s decision logic. It assumed that errors would be localized and traceable. And it assumed that the system operated within a defined scope, on defined inputs, at speeds where human review was at least theoretically possible.

Deep learning shattered the first assumption. The “black box” problem is not merely an inconvenience; it is a structural feature of how these systems work. You can probe attention heads, apply SHAP values, run integrated gradients — and still not be able to explain why a model classified this particular biopsy image as malignant, or why it recommended this treatment protocol over that one. Explainable AI (XAI) research has made genuine progress, but it has also repeatedly demonstrated the gap between local approximations and genuine mechanistic understanding.

Generative AI shattered the second assumption. When a system can hallucinate — generate plausible, confident, internally consistent text that is simply false — errors are no longer localized events traceable to a faulty input or a mislabeled training example. They are a probabilistic feature of the system itself, distributed across every output, calibrated against nothing but the statistical texture of training data.

Agentic AI is now shattering the third. A system that can browse the web, write and execute code, call APIs, spin up sub-agents, and iterate toward a goal over dozens of steps is not operating at a speed or scope amenable to real-time human review. The loop has been opened. The human is no longer in it in any meaningful sense.

The Economic Pressure Nobody Talks About Enough

There is another force operating beneath the surface of the regulatory debate that deserves more direct attention: economic pressure.

Human oversight is expensive. It requires trained professionals who can interpret AI outputs in context, catch subtle errors, and maintain attention across long monitoring sessions without succumbing to what we describe in the paper as “audit fatigue.” In high-throughput clinical settings — radiology, pathology, clinical decision support — the economic case for AI is predicated precisely on reducing that human labor.

This creates a structural tension that no regulatory framework has yet resolved cleanly. The EU AI Act mandates meaningful human oversight for high-risk AI systems. But “meaningful” is doing enormous work in that sentence. A radiologist reviewing 400 AI-flagged scans per shift is technically “in the loop.” Whether their review constitutes genuine oversight, or whether automation bias has effectively turned them into a rubber stamp, is a different question — one that regulators have been reluctant to force into the open.

The pressure is even sharper in the agentic AI space, where the value proposition is almost entirely about replacing human judgment in multi-step workflows. If you mandate that a human must review and approve each step of an agentic pipeline, you have largely destroyed the use case. The industry knows this. The regulatory frameworks are still pretending otherwise.

Two Things That Actually Help

We are not pessimists about this. The picture that emerges from a serious analysis is not “oversight is impossible, give up.” It is something more specific and more actionable. There are two approaches that have genuine traction — and both are more demanding than the compliance-checkbox version of oversight that currently dominates policy discussions.

1. Qualitative Monitoring and Risk Handling

The first is a shift from verification to monitoring — from trying to check every output to building systems that can detect when something has gone wrong and respond appropriately. This is a fundamentally different epistemic stance.

Instead of asking “is this output correct?” before it is used, you ask “what patterns of outputs, at population scale, would tell us this system is failing in some systematic way?”

This requires investment in qualitative monitoring infrastructure: outcome tracking linked to AI-assisted decisions, statistical process control adapted for probabilistic systems, red-teaming and adversarial probing as ongoing practice rather than one-time certification. In biotechnology and clinical AI specifically, it requires the kind of post-market surveillance architecture that medical device regulation has developed over decades — applied now to software that updates continuously and behaves differently across patient populations.

Risk handling, in this frame, is not about preventing all errors. It is about building systems where errors are detectable, contained, and correctable — where failure modes are known in advance and mitigated by design rather than by hoping the human reviewer catches everything. This includes architectural choices: interfaces that surface uncertainty explicitly, output formats that flag low-confidence predictions, workflow designs that route edge cases to human review rather than treating all cases identically.

2. Evidence-Based Approaches and Abstention Mechanisms

The second, and in our view more urgent, approach addresses the hallucination problem directly. For generative AI deployed in high-stakes domains, the absence of a reliable abstention mechanism — a way for the system to say “I don’t know” or “I am not confident enough to answer this” — is not a minor limitation. It is a patient safety issue.

Evidence-based approaches ground generative AI outputs in verifiable, curated sources rather than letting the model generate freely from its parametric memory. Retrieval-Augmented Generation (RAG) is the best-known instantiation: instead of asking the model to recall facts, you provide retrieved documents and ask it to synthesize and reason over them. This does not eliminate hallucination, but it changes the error mode from “confident fabrication” to “misreading a real source” — which is both less dangerous and more detectable.

Abstention mechanisms go further. A system that can recognize when a query falls outside its reliable knowledge — and communicate that uncertainty rather than generating a confident answer — is qualitatively safer than one that always produces output. The technical challenges are real; calibrating confidence in LLMs is an active research problem. But the clinical and regulatory imperative is clear.

A diagnostic AI that says “I cannot make a reliable determination from this sample” is more valuable than one that is wrong 15% of the time with full confidence.

Self-monitoring, knowledge base integration, post-hoc fact checking, and RLHF-based correction of hallucinated patterns are all partial implementations of this broader principle. These are not optional quality improvements for generative AI in clinical contexts — they are foundational requirements that should be non-negotiable in any responsible deployment framework.

What This Means for the HMMC Research Agenda

At HMMC, our work on human-machine collaboration is directly shaped by this analysis. The question of oversight is not separable from the question of how human and machine intelligence should be integrated in the first place. An AI system designed from the beginning for auditability — with uncertainty quantification built in and outputs structured to support rather than replace human judgment — is a fundamentally different artifact from one designed for maximum autonomous capability and retrofitted with oversight features as an afterthought.

This is not primarily a technical question. It is a design philosophy, and ultimately a values question: what kind of human-AI relationship do we want to build, and what constraints are we willing to accept in exchange for the safety properties that make that relationship trustworthy?

The economic pressures are real. But so is the cost of getting this wrong — in healthcare, in biotechnology, in any domain where AI-assisted decisions have consequences for human lives.

The paper we published in New Biotechnology is a starting point for that conversation. We will continue it here.

This post is part of the HMMC series on human-machine collaboration in high-stakes domains. The views expressed are those of the authors.

Reference

Holzinger A, Zatloukal K, Müller H. Is human oversight to AI systems still possible? New Biotechnology. 2025;85:59–62. https://doi.org/10.1016/j.nbt.2024.12.003