Preventing AI Fabrication in Clinical Documentation

One invented medication is enough. One invented order. One physical exam finding that never happened. That is all it takes for a physician to stop trusting the note.

Fabrication in clinical documentation is not an abstract LLM problem. It is product safety. It is workflow safety. It is trust. And the dangerous version is not always the ugly one. Sometimes the dangerous note reads beautifully. That is the trap.

Fluency Is Cheap

A note can sound excellent and still be unsafe. That sentence should make clinical AI teams uncomfortable.

The polished wrong note is worse than the awkward wrong note because it can slide past a tired reviewer. If the product has trained the clinician to relax, fluent fabrication is even more dangerous.

So I don't get excited when someone says the output "reads really well." Good. Now show me where every high-risk statement came from.

Break the Job Apart

Big monolithic prompts fail in ways that are hard to inspect. I prefer smaller stages with narrow jobs:

Extract source facts
Select candidate section content
Validate high-risk entities
Preserve uncertainty and negation
Assemble the final note
Run pre-output safety checks

That structure does not make the system perfect. It makes failure easier to find. When the pipeline is decomposed, you can stop arguing about whether the note is "better." You can start asking where it failed.

The split I care about is simple. Let the model do the language work where language is useful: extraction candidates, section drafts, and synthesis under constraints. Let deterministic checks do the things that should not be vibes: medication matching, allergy matching, negation checks, modality checks, unsupported entity detection, and required evidence IDs.

If a high-risk statement cannot produce its evidence, it should not reach the final note.

Start With Extraction

Most clinical documentation sections should start with a blunt rule: if it is not in the source context, don't write it.

That sounds obvious. It is still where a lot of systems drift. The model wants to be helpful. Helpful often means filling gaps. In a chart, filling gaps can become fabrication.

The product video version of a note is smooth and reassuring. The real test is uglier: where exactly did this sentence come from? If the team can't answer that question, the sentence should not survive.

Here is the kind of failure I mean:

Source:
Patient denies fever. No antibiotics were given. Home medication list includes lisinopril.

Candidate note sentence:
"Patient was febrile at home and received ceftriaxone prior to arrival."

Validator behavior:
fever = unsupported
ceftriaxone = unsupported medication/intervention
final behavior = remove sentence or ask clinician to confirm

That is not fancy. That is the point. The validator does not need to understand the whole encounter. It needs to know that "fever" and "ceftriaxone" are high-risk claims with no evidence behind them.

Make Claims Earn Their Place

Before final assembly, candidate statements should map back to evidence. No evidence means one of three things:

Remove it
Flag it
Ask the clinician

That single rule clears out a surprising amount of bad output. It also changes the culture of the team. You stop debating whether an unsupported sentence is "reasonable." You ask whether it is supported.

If it isn't, it is gone. That sounds harsh. Clinical notes need that kind of harsh.

Treat High-Risk Terms Differently

Some terms deserve almost no tolerance for approximation:

Medications
Allergies
Procedures
Critical diagnoses
Orders

Near-match is not good enough for these classes. Semantic similarity is useful in plenty of domains. In clinical documentation, it can create a false diagnosis, a false action, or a false legal story.

"Pretty close" is not a safety standard.

Preserve Negation and Modality

"Possible PE" is not "PE." "No chest pain" is not "chest pain." "The patient denies fever" is not "fever." Those are not wording details. They are the clinical fact.

Models collapse possible, probable, confirmed, denied, and absent all the time if the system lets them. That is not a formatting issue. That is the note changing reality. If a documentation system can't preserve modality and negation, it should not be trusted with autonomous chart text.

Absence Is a Decision

Missing evidence should not trigger plausible prose. If the source does not support a section, the system needs a disciplined behavior:

Leave it empty
Mark uncertainty
Escalate for confirmation

Do not let the model pad the record because the output looked too sparse. Sparse and honest beats complete and fake.

The Checklist Has to Be Real

Before final note assembly, I want deterministic checks:

Are high-risk statements evidence-backed?
Are negation and modality preserved?
Are unsupported specifics removed?
Is uncertainty visible where it belongs?
Did any section invent an entity?

If it fails, it does not ship. Not "send to the clinician and hope they catch it." That is not safety. That is outsourcing cleanup.

"Human in the loop" can become a lazy phrase. If the upstream system keeps handing physicians polished unsafe output, the human becomes an error sponge. I am not interested in that model. The product should reduce cognitive burden, not repackage it.

Monitoring Is Product Work

You need continuous signal on the real failure surface:

Fabrication flags by section
Unsupported entity rates
Negation and modality drift
Physician correction patterns
Repeat complaint classes

Physician corrections are not noise. They are product data. They show where the system broke trust, where the validators missed, and where the workflow made review harder.

If you are not learning from corrections, you are not operating the product. You are waiting for the next complaint.

Reliability Is the Product

Safer clinical documentation AI comes from system design: architecture, constraints, validators, monitoring, rollback, and a culture that treats unsupported facts as defects, not rough edges.

Prompt cleverness is not enough. In medical documentation, reliability is the product. Everything else comes after.