Andrew Napier

5 min read

Changing Models in Clinical Documentation Without Losing Control

A new model is not a reason to touch the chart. Failure modes decide whether migration is worth it.

clinical-aillm-safetymodel-migration

A new model is not a reason to touch the chart. That is the starting point for me.

If the current model is failing in ways I can't accept, I look for a replacement. If it is stable and the failure modes are understood, I leave it alone.

That sounds conservative because it is. Clinical documentation is a bad place to satisfy technical curiosity.

The Chart Is Not a Sandbox

The failures that matter in high-acuity documentation are usually not mysterious. The system invents a medication order. It adds an intervention that never happened. It turns "possible" into "confirmed." It compresses "no chest pain" into "chest pain." It makes the plan sound cleaner than the actual clinical reasoning was.

One fabricated sentence is enough to break trust. That is the part people outside healthcare often miss.

In normal software, a bad output can be annoying. In the medical record, a bad output can become part of the story everyone else reads later. It can shape downstream care. It can create legal exposure. It can make the physician stop trusting every other sentence in the note. Once that trust breaks, the model does not get a lot of second chances.

Prompting Is Not the Whole Game

Good prompts help. Extraction-first instructions help. They are not enough.

Long, messy clinical context is where models start taking liberties. The note is full of uncertainty, shorthand, implied reasoning, side comments, and things a physician knows to leave qualified.

If the architecture lets the model jump from messy source material to polished chart prose in one move, you lose visibility. You might get a note that reads well. You just don't know how many liberties it took to get there.

That is why I treat model migration as a systems problem, not a prompt rewrite. The system has to force every claim to earn its place.

How I Would Migrate a Model

I want the evaluation to look like the actual workload. Not cherry-picked examples. Not a handful of prompts that make the new model look good.

Representative note types:

  • HPI-heavy encounters
  • Critical care
  • Advance care planning
  • Complex differentials
  • Disposition-heavy notes
  • Procedure documentation

Then score what actually matters:

  • Fabrication
  • Omission
  • Uncertainty handling
  • Negation handling
  • High-risk entity drift
  • Style drift that creates clinical ambiguity

The question is not "Is the new model better?" The question is narrower: is it safer for this stage of this pipeline under these constraints? That is a much better question.

Migrate by Stage, Not by Vibes

A modular documentation pipeline changes the migration conversation. You can test and promote one stage at a time:

  • HPI extraction
  • Physical exam selection
  • Physical exam synthesis
  • Critical care documentation
  • ACP documentation
  • Diagnostic synthesis
  • Assessment and plan
  • Final chart assembly

If the new model is better at HPI extraction and worse at assessment text, that is useful. Use it where it helps. Don't promote it everywhere just to make the migration feel cleaner.

"Better overall" is often a way to hide section-specific failures. Clinical notes don't fail overall. They fail in a line, a section, a diagnosis, an order, a missing qualifier. That is where the review has to live.

A real migration decision should be that specific. Take a candidate model that improves HPI extraction. It catches the timeline better, keeps the patient's actual words closer to the source, and misses fewer late-history corrections.

Good. Now put the same model into assessment and plan generation.

If it turns "possible early pneumonia" into "pneumonia," adds ceftriaxone when the chart only says it was considered, or drops the reason the physician chose observation instead of discharge, it does not move forward for that stage.

The decision is not complicated:

  • Promote it for HPI extraction if it beats the current model on source fidelity
  • Keep the old model for assessment and plan if the candidate adds false certainty
  • Set the rollback trigger before launch, not after the first angry physician complaint
  • Recheck the section after real production complaints, because offline review never catches everything

That is what controlled migration looks like. No victory lap. Just a smaller blast radius.

What Counts as Better

I don't care if the new output sounds more impressive. I care if the failure surface improves.

Real improvement looks like this:

  • Fewer fabrication events in high-risk sections
  • Stronger adherence to extraction-only rules
  • Better preservation of uncertainty language
  • Less false specificity in plans and diagnostic reasoning
  • Fewer unsupported medications, orders, procedures, and diagnoses

The model might deserve some credit. The system deserves more.

Most safe gains come from model behavior plus guardrails: evidence grounding, validators, constraints, and monitoring. If you credit the model for everything, you will keep chasing the next model. If you credit the system, you will keep improving the product.

Guardrails I Do Not Negotiate With

For production clinical documentation, I treat these as table stakes:

  • Evidence grounding before final assembly
  • Strict validation for high-risk entities
  • Preservation of modality: possible, probable, confirmed
  • Preservation of negation
  • Absence-handling gates
  • Section-level rollback
  • Post-deployment monitoring by failure class

If a sentence cannot be tied back to encounter evidence, it should not ship. I don't care how fluent it sounds. Fluency is exactly what makes unsupported clinical text dangerous.

The Cost of Doing It Right

No migration is free. You will rewrite prompts, retune token budgets, revalidate formatting assumptions, watch throughput, re-score sections you thought were settled, and argue about whether a prettier note is actually safer.

Good. That friction is the price of touching the chart.

If you are building a demo, convenience can win. If you are writing into the medical record, convenience should lose almost every time. The safest architecture wins. Not the newest model. Not the prettiest output. The one that limits damage when the model is wrong.