Andrew Napier

7 min read

Building a Chest Pain Risk Stratifier from Clinical Conversations

Chest pain exposes overconfident clinical AI. Useful support beats false diagnosis.

machine-learningfastapiclinical-informaticsemergency-medicine

A chest pain patient can look fine right up until the case turns. They walk in and talk normally. "Pressure" becomes "indigestion." Then the exertional part shows up after one more question.

The first EKG buys you time. It doesn't buy you certainty. Chest pain was the right stress test for this prototype because it punishes confident shortcuts.

I wasn't interested in building a fake doctor. I wanted the thing that actually helps in the room: keep the differential visible, surface the signal, and make uncertainty hard to ignore.

The Line I Wouldn't Cross

I was not building autonomous diagnosis. I was building differential framing and workup support from messy encounter data. Draw that line early and the build changes.

If the tool is trying to replace the physician, it starts chasing certainty it hasn't earned. If the tool is trying to support the physician, it has to rank, frame, preserve uncertainty, and explain. Then it has to stop there.

Those are different systems. A lot of clinical AI gets into trouble because teams say "assist" while designing for replacement. The risk is safety, not semantics.

Why The Interview

Chest pain thinking starts before the chart is clean. It starts in the first exchange.

The signal is not one thing. It is timing, exertion, radiation, pleuritic features, prior episodes, what the patient means by "pressure," and the thing they forgot to mention until you asked again. Useful signal often exists before it becomes structured data.

Clinical conversations were the point. The interview is raw material for the note, but it is also where much of the diagnostic texture lives. If you only look after the encounter has been compressed into boxes, you can lose the part that made it make sense.

The Differential

I used a multi-label frame because ED reasoning rarely works like a single-answer quiz. Core labels included:

  • Acute coronary syndrome
  • Pulmonary embolism
  • Aortic dissection
  • Pericarditis
  • Musculoskeletal pain
  • Anxiety-related presentations
  • Other chest pain etiologies

I was not trying to make the model pick one answer. I wanted a system that organized risk while the clinician still owned the reasoning.

Prototype Boundary

The public version matters here. This was a prototype pattern, not a deployed diagnostic product. I am not publishing patient transcripts, PHI, or a claim that this should drive care independently.

The public demo cases were synthetic. The reviewer pass was an offline phase 1 check, not prospective validation. Five cases and three physicians can expose failure modes. They cannot prove clinical safety.

The input shape was encounter-level signal: transcript-derived text, structured facts, and binary clinical indicators. The output shape was risk-oriented support for clinician review: ranked possibilities, cues, and uncertainty.

That boundary kept it honest. The useful artifact looked less like a diagnosis and more like a case handoff.

{
  "differential_support": [
    {
      "label": "pulmonary embolism",
      "signal": ["pleuritic pain", "dyspnea mentioned later"],
      "uncertainty": "incomplete history"
    }
  ],
  "conflicts": ["early note says no shortness of breath; later history says dyspnea"],
  "missing_negatives": ["leg swelling", "hemoptysis"],
  "review_prompt": "Confirm dyspnea timeline before interpreting PE risk."
}

Good.

If the story is incomplete, the tool should not fill in the missing negative. If the chart says "no shortness of breath" in one place and later mentions pleuritic dyspnea, the correct behavior is not to quietly pick the exciting label. It should surface the conflict, lower confidence, and make review obvious.

The artifact I kept coming back to was not an accuracy slide. It was the failure table. One row looked like this in plain English: patient reports "pressure" after exertion, then later says it was sharp and worse with inspiration. The structured field says no dyspnea. The transcript mentions shortness of breath only after the clinician asks a second time.

The pass condition was not "predict PE." The pass condition was narrower.

  • Do not erase the conflict.
  • Do not treat the missing negative as confirmed absent.
  • Keep ACS, PE, and other serious etiologies visible.
  • Tell the clinician what fact changed the ranking.
  • Leave the final call in the clinician's hands.

A fixture like that is less impressive than a polished demo. The failure table made the design judgment visible.

The System Was Intentionally Simple

The stack wasn't the hard part. The hard part was deciding what the system was allowed to claim.

The control repo paired two things I cared about:

  • Ambient patient presentation
  • Physician-authored decision trace

The decision trace mattered. Training only on downstream chart text teaches the system how the note ended. It does not teach the system how the physician got there.

The first version was deliberately boring:

  • 48 regex-style clinical features
  • Multi-label logistic regression
  • ACS probability
  • Troponin positivity
  • Disposition signal
  • 18-diagnosis Bayesian differential

The feature set was not exotic. It pulled on things ED physicians already care about.

  • Exertional pressure.
  • Radiation to arm, jaw, or back.
  • Diaphoresis.
  • ECG ischemic changes.
  • Troponin positivity.
  • Pleuritic quality.
  • Tachycardia or hypoxia.
  • DVT risk factors.

Not magic. Good. Magic is hard to debug.

High-level flow:

  1. Ingest structured and semi-structured encounter features
  2. Build text features from transcript and note signal
  3. Add binary clinical indicators
  4. Run a multi-label logistic regression pipeline
  5. Return risk-oriented output with confidence framing
  6. Generate workup cues for clinician review

The API contract helped keep the model in its lane. FastAPI was useful because it forced the boring parts to be explicit.

  • Required source fields.
  • Validated feature payload.
  • Predict endpoint.
  • Output schema with evidence, conflicts, missing negatives, and confidence metadata.
  • Logging hooks for reviewer disagreement.

No dramatic diagnosis. No fake certainty. No black-box answer pretending to be clinical judgment.

Why Hybrid Features

Text alone was too loose. Pure structured features were too thin. The hybrid approach made more sense: narrative signal plus key clinical indicators.

Clinicians already think this way. We use the story and the structure. The same risk factor can mean something different depending on the history around it.

A young patient with sharp pain after coughing is different from a patient with exertional pressure and risk factors. The second story can get worse as the questions get better. Nobody in the ED needs that explained. The model has to be forced to respect it.

The Output Had to Be Inspectable

In chest pain, a confident wrong answer is worse than a modest useful one. The output contract mattered more than the model choice.

I wanted four things.

  • Transparent baseline behavior.
  • Multi-label output.
  • Calibrated likelihoods and explicit uncertainty labels where raw scores would overstate certainty.
  • Failure review that could be discussed with clinicians.

The system had to make insufficient evidence visible. Many clinical demos feel wrong to physicians because the answer carries more confidence than the case can support.

A patient with exertional pressure, pleuritic features, and a late dyspnea mention should not get one clean number with no argument attached. The interface had to show competing possibilities, the facts behind each one, and the missing negatives that would change the next question. If the tool could not do that, I did not care how clean the score looked.

That also changed the evaluation. I would not start with a leaderboard. I would start with reviewer questions:

  • Did the tool preserve the thing that made the case scary?
  • Did it expose contradiction instead of smoothing it away?
  • Did it make the missing history easier to see?
  • Did the output help the physician ask a better next question?

If the answer is no, the model score does not matter much. The better bar is whether the tool helps the physician think before certainty is available.

Where It Failed

The useful threshold is not whether the prototype feels impressive. It is whether a skeptical clinician can see why the tool is worried, what evidence is missing, and where the tool is uncertain.

The early reviewer pass made the weakness obvious. Three board-certified EM physicians reviewed the same five case fixtures. Fifteen reviews total. Each reviewer saw the same case text, structured feature summary, ranked output, conflicts, and missing negatives. They compared the model output against the case signal and physician-authored decision trace.

The system over-called ACS probability by 33.5 percentage points on average. The worst case was a 73.9 point over-call. Reviewers flagged dangerous-miss findings in 6 of 15 reviews. PE was the most common one. A dangerous miss meant the tool failed to keep a serious diagnosis visible enough for clinician review.

This is the kind of result a prototype should surface before anyone gets seduced by a clean demo. My takeaway was not "the model works." My takeaway was narrower: the extraction layer can be useful while the synthesis layer still needs adult supervision.

Multi-label framing fits ED reasoning better than forced single-label output. Hybrid features are a practical bridge between narrative data and clinically stable signal. Clean APIs matter early if the work might move beyond a toy demo.

But the real bar is simpler: can this help a clinician think better before certainty is available? If I took it further, I would start with cases where the history changes as the interview improves. Weak clinical AI gets exposed there.