Andrew Napier
Preventing AI Fabrication in Clinical Documentation
How I think about keeping physician-facing documentation systems from quietly making things up and losing trust.
Fabrication in clinical AI is not an abstract research problem. It is a product safety problem.
If your system invents one medication, one order, or one physical exam finding, trust drops to zero.
That is not rhetorical. That is how physicians actually respond when they catch the system making something up. The tolerance for fluent dishonesty is extremely low, and it should be.
This is the stack I care about when the job is physician-facing documentation.
The reason I care so much about this is simple. Most people overfocus on whether the note sounds good. That is the wrong metric. The note can sound excellent and still be clinically unsafe.
In fact, fluent unsafe output is often worse than clumsy unsafe output. At least the clumsy version advertises that something is off. The polished version can slide past a tired reviewer if the product has trained them to relax.
That is why I do not find “it reads really well” especially impressive. Reading well is cheap. Remaining faithful under pressure is the harder thing.
1) Decompose the Task
Monolithic prompts fail in ways that are hard to inspect. I break chart generation into smaller agents so each stage has a narrow job and clear constraints.
Decomposition gives you:
- Smaller reasoning surfaces per step
- Better observability when something goes wrong
- Isolated rollback options without full pipeline disruption
It also forces discipline. Once each stage has a narrow job, it becomes harder for the system to hide bad behavior inside a big polished paragraph.
That matters because fabrication often enters through ambiguity. If the task is too broad, the model starts filling gaps. If the task is narrow, the model has fewer excuses to improvise.
It also makes internal conversations more honest. When the pipeline is decomposed, you can stop having vague debates about whether the system is “doing better” and start looking at where it is actually failing. That is a much healthier way to build.
2) Enforce Extraction-Only Defaults
Most sections should start with an extraction-first rule. If it is not in the source context, do not write it.
That sounds obvious. Plenty of systems still fail here because they chase smooth prose instead of faithful output.
That tradeoff is one of the main reasons I am skeptical of a lot of clinical documentation demos. They are often built to look impressive in a product video, not to survive a hostile chart audit.
The product video version of the note is usually smooth, concise, and reassuring. The real question is whether it survives a physician who goes line by line and asks, “Where exactly did this come from?” That is a much uglier test. It is also the only one I care about.
I think more teams should force themselves to live in that uglier test. It clears out a lot of self-deception very quickly.
3) Add Evidence Grounding Gates
Before downstream assembly, candidate statements should map back to source evidence.
If no evidence exists, the statement is removed or flagged.
This removes a lot of bad output by itself. More than most teams want to admit.
It also gives you a cleaner internal argument about what the system is allowed to do. If a claim cannot be tied to source evidence, then it is not a candidate fact. That should not be controversial. In practice, it still is.
It should also shape product culture. Once the team accepts that unsupported facts are simply not eligible output, a lot of bad downstream debates disappear. You stop arguing about whether the sentence was “reasonable.” You ask whether it was supported.
That change in culture is underrated. Without it, the team starts drifting toward justification. With it, the team stays oriented around evidence.
4) Validate High-Risk Terms
I maintain explicit checks for high-risk entities such as:
- Medications
- Allergies
- Procedures
- Critical diagnoses
- Orders
For these classes, near-match is not good enough. I use strict or tightly bounded matching.
The closer the term is to an order, diagnosis, medication, or procedure, the less I want “pretty close” behavior. Clinical systems get in trouble when they treat semantic similarity as permission.
This is another place where general-purpose NLP instincts can mislead people. In many domains, approximate meaning is good enough. In clinical documentation, approximate meaning can create a false diagnosis, a false action, or a false legal story.
People who have not spent time inside clinical workflows tend to underestimate that. They hear “semantic similarity” and think they are hearing intelligence. Sometimes what they are actually hearing is risk.
5) Preserve Modality and Negation
Clinical meaning depends on modality and negation.
“Possible PE” is not “PE”. “No chest pain” is not “chest pain”.
Your validator has to treat modality and negation as part of the clinical fact itself. They are not formatting details. They are the fact.
This sounds basic until you look at real output. Models collapse “possible,” “probable,” and “confirmed” all the time if you let them. They also flatten negatives into positives under summarization pressure.
That is not a small formatting error. That is the model changing the clinical reality of the note. If a system cannot preserve those distinctions, it does not belong anywhere near autonomous note generation.
That sounds harsh. It should. Clinical language is full of small distinctions that carry large consequences. If the system cannot carry those distinctions reliably, the system is not ready.
6) Build Cognitive Gates for Absence Handling
Absence is its own decision point.
If evidence is missing, the model should not compensate with plausible language. It should either:
- Leave section content empty,
- Mark uncertainty clearly, or
- Escalate for human confirmation.
7) Pre-Output Safety Checklist
Before final note assembly, run a deterministic checklist:
- Is every high-risk statement evidence-backed?
- Are modality and negation preserved?
- Are unsupported specifics removed?
- Are confidence and uncertainty communicated correctly?
If it fails, it does not ship.
That rule needs to be real. Not advisory. Not soft. If the checklist is optional, it will be ignored the first time the team gets tired or rushed.
I have become pretty skeptical of “human in the loop” as a magic phrase for this reason. If the upstream system keeps handing the clinician polished unsafe output, the human review step becomes cleanup labor. That is not a serious safety design.
Too often “human in the loop” really means “doctor as error sponge.” I am not interested in that model. The product should reduce unsafe cognitive burden, not repackage it.
8) Treat Monitoring as Product, Not Ops
I want continuous signal on failure classes, not just user complaints after the damage is already done.
Track:
- Fabrication flags by section
- Unsupported entity rates
- Negation/modality drift
- Human correction patterns
Correction data should go straight back into prompts, validators, and failure analysis.
If you are not learning from physician corrections, then you are not really operating the product. You are just waiting for the next complaint.
Correction data is one of the few places where you get direct contact with the real failure surface. Ignoring it is like having telemetry and refusing to open the dashboard.
It is also where you learn what actually irritates physicians enough to break trust. That matters. The failure modes that look minor to an ML team are sometimes exactly the ones that make a clinician stop relying on the product.
Bottom Line
Safer clinical AI comes from systems design, not prompt cleverness.
You need architecture, constraints, validation, and monitoring working together.
In medical documentation, reliability is the product. Everything else is secondary.
The sooner a team accepts that, the faster the rest of the design starts making sense. Until then, they usually keep optimizing the wrong thing. Usually because they are still mistaking polish for safety.