Andrew Napier

7 min read

Building a Chest Pain Risk Stratifier from Clinical Conversations

Why I worked on chest pain risk support from conversation data and why support matters more than false certainty.

#machine-learning#fastapi#clinical-informatics#emergency-medicine

Chest pain is one of the easiest complaints to get fooled by in the ED.

The dangerous cases do not always look dramatic at the start. That is the problem.

I worked on this prototype because I wanted something that could support the differential without pretending it was making the diagnosis.

That boundary mattered to me from the beginning. I do not think most clinical AI gets into trouble because people are trying to be reckless. I think it gets into trouble because teams start using language like “assist” or “augment” while quietly designing for replacement.

That distinction is not semantic. It changes everything about the system design. Once you decide the product is support rather than replacement, you stop chasing false certainty and start thinking much harder about framing, ranking, and uncertainty.

Problem Framing

The goal was not autonomous diagnosis.

I wanted to turn messy conversation data into something useful for differential framing and workup planning.

That matters because a lot of chest pain thinking happens before the chart is clean. It happens in the first exchange. It happens in the way the patient describes timing, exertion, radiation, pleuritic features, prior episodes, and what they are worried about. If you wait for a perfectly normalized record, you are already late.

That is also why I cared about conversation data in the first place. The signal is often present before the structured chart catches up. You can lose useful nuance if you only look after everything has been compressed into boxes.

The patient interview is not just pre-processing for the chart. It is where a lot of the real diagnostic texture lives. How the pain started, what the patient means when they say pressure versus sharpness, what makes them worried, what they leave out until the third question, all of that matters.

The core differential set included:

  • Acute coronary syndrome (ACS)
  • Pulmonary embolism (PE)
  • Aortic dissection
  • Pericarditis
  • Musculoskeletal pain
  • Anxiety-related presentations
  • Additional chest pain etiologies in a multi-label setup

System Design

I kept the stack simple on purpose.

When people talk about clinical AI prototypes, they often want to jump straight to the fanciest model they can run. I did not think that was the hard part here. The harder part was deciding what the system was actually allowed to claim and how uncertainty should appear in the output.

That is where I think a lot of healthcare AI work still feels under-disciplined. People spend their ambition on the model and not enough on the claim boundary. In clinical settings, the claim boundary matters more.

That is a recurring theme for me in clinical AI. The hard part is usually not making the system say something impressive. The hard part is making it say only what it can defend.

High-level flow:

  1. Ingest structured and semi-structured encounter features
  2. Build feature vectors from TF-IDF text + binary clinical indicators
  3. Run multi-label prediction pipeline
  4. Return risk-oriented outputs with confidence framing
  5. Generate workup guidance for clinician review

Feature Strategy

I used a hybrid feature model:

  • TF-IDF for narrative signal from conversation/documentation text
  • Binary features for key clinical findings and risk factors

Text alone was too loose. Pure structured features were too thin. The hybrid setup worked better because it kept the narrative signal without letting the whole thing collapse into text noise.

That also matched the way physicians actually think. We use structured findings, but we also use the story. The story changes what the same vital signs or risk factors mean.

A young patient with sharp pain after a coughing fit and a normal story around it does not feel the same as the patient with exertional pressure, risk factors, and a history that keeps getting worse as you ask more questions. The point is obvious to a physician. The model still has to be taught how to respect that difference.

And even then, I do not want the model pretending it has physician intuition. I want it surfacing signal in a way that helps the clinician orient faster. That is a much more defensible goal.

Modeling and Evaluation

I cared a lot about interpretability and calibration because a confident wrong answer is worse than a modest useful one.

Approach included:

  • Logistic-style baseline models for transparent behavior
  • Multi-label output handling for overlapping differentials
  • Bootstrap confidence intervals to reduce overinterpretation of point estimates

In chest pain work, confidence communication matters almost as much as ranking. If a model sounds certain when it should not, it becomes dangerous fast.

That is one of the places where a lot of clinical ML work still feels unserious to me. The question is not whether you can rank a few labels. The question is whether the system knows when it does not know enough.

That is where calibration becomes more than a statistics word. It becomes a design constraint. If the system cannot express uncertainty honestly, then the interface around it has to compensate or the product becomes misleading.

I think that is one reason so many clinical demos feel wrong to physicians even when the outputs look polished. The system sounds more certain than the underlying evidence deserves. That tone mismatch is not cosmetic. It changes how the user interprets the tool.

API Layer

FastAPI gave us clean operational boundaries:

  • Input schema validation
  • Predict endpoint for differential scoring
  • Output schema with confidence metadata
  • Logging hooks for post-hoc evaluation and model drift review

That also made it easier to test with clinicians and inspect failures without pretending the model was more stable than it was.

I wanted something that could be interrogated. If the output changed, I wanted to know why. If a physician disagreed with the system, I wanted the disagreement to be concrete instead of philosophical.

That kind of inspectability is not glamorous, but it is what turns a prototype into something you can actually discuss with other clinicians. Otherwise the whole conversation becomes vague very quickly.

And vague is exactly what I do not want in a high-stakes workflow. If the model is wrong, I want to be able to say how. If the physician disagrees, I want the disagreement to happen on something concrete.

Clinical UX Principle

The UI should support clinical reasoning, not replace it.

Output was designed as:

  • Differential likelihood support
  • Structured rationale cues
  • Suggested workup categories for clinician confirmation

That distinction matters. If the product starts acting like it can replace judgment, the design is already drifting in the wrong direction.

I do not think physicians need another black box that gives them a number and expects obedience. They need systems that can surface the right competing possibilities, point to the relevant signal, and get out of the way.

That is the type of support I think is worth building. Not a machine that performs confidence. A machine that helps the clinician orient faster.

That is also a better standard for product humility. If the tool cannot explain why it is nudging the differential in a certain direction, it should probably be quieter.

Key Lessons

  1. Multi-label framing matches real ED reasoning better than forced single-label outputs.
  2. Hybrid features are a practical bridge between narrative data and clinically stable signals.
  3. Confidence intervals help prevent false certainty in ambiguous presentations.
  4. Clean APIs matter early if there is any chance the work will move beyond a toy demo.

Another lesson is that clinical usefulness and model sophistication are not the same thing. A simpler model with tighter framing is often more useful than a fancier system that cannot explain itself and cannot communicate uncertainty.

That is especially true in a workflow like chest pain where overconfidence is expensive. If the system sounds too certain at the wrong moment, it can distort the whole encounter.

Next Steps

If I take this further, the next steps are:

  • Better uncertainty calibration under sparse data conditions
  • Tighter integration with documentation pipelines
  • Human-in-the-loop feedback loops for continuous refinement
  • Expanded evaluation aligned with pragmatic clinical AI frameworks

I am not interested in one more flashy clinical AI prototype. I am interested in support that helps a physician think more clearly when the patient in front of them could still be very sick.

That is still the standard I would use if I took this further. Not whether the system feels impressive in a demo. Whether it helps a clinician think better before certainty is available. That is a much harder bar, and a much more useful one.