Skip to content

When the AI Knows Two Caitlins and Picks the Wrong One

A walkthrough of a subtle patient name matching bug in an AI-powered fax processing pipeline — and why a shared first name is never enough to make a match.

Matt Dennis

A fax came in for a new patient. The document clearly showed her name. The system filed it under someone else entirely.


The name on the document was Caitlin Mercer. The name the system stored was Caitlin Rhodes. Both are real patients. They share a first name. They are not the same person.


This is the story of how that happened, and how we fixed it.


The Pipeline

The fax processing system for this medical practice is a multi-step AWS Lambda pipeline. Incoming faxes land in S3, get run through Textract for OCR, and then get analyzed by a Bedrock model (Amazon Nova) that reads the PDF directly and returns a structured JSON object: document type, summary, findings, and patient name.


That patient name field is the first half of the problem.


Nova’s job is to extract the name as written on the document. Most of the time it does this well. But raw extraction from a fax isn’t always clean — names get formatted as “Lastname, Firstname,” OCR occasionally garbles a character, and the same patient might appear as “Caitlin M.” on one document and “C. Mercer” on another. Storing the raw extraction as the canonical name produces inconsistency over time.


So we added a normalization step: the Patient Name Service.


The Patient Name Service

The Patient Name Service is a Make.com scenario that sits between Bedrock’s raw extraction and the final DynamoDB write. It receives the extracted name, fetches the current patient registry from S3, and passes both to Gemini with a carefully engineered prompt.


The prompt uses what we called a Surname Gate algorithm:


  1. Normalize the input — handle “Lastname, Firstname” inversions, strip punctuation
  2. Surname Gate — check if the last name is phonetically similar to any surname in the registry. If not, stop immediately and return the input as-is (new patient)
  3. First name verification — only if the surname gate passed, verify the first name also matches
  4. Return either the canonical registry name (if matched) or the normalized input (if new patient)

The prompt is explicit: “A shared First Name is NOT enough to create a match. The Last Name is your anchor.”


The logic is correct. The model didn’t follow it.


What Went Wrong

Caitlin Mercer is a new patient. “Mercer” is not in the registry. By the rules of the Surname Gate, the algorithm should stop at step 2 and return “Caitlin Mercer.”


Instead, Gemini 2.5 Flash-Lite saw “Caitlin” — which is in the registry, attached to “Walsh” — and made a leap. The first-name match apparently carried more weight than the explicit instruction that it shouldn’t. The surname gate passed when it shouldn’t have, the first names matched, and the system confidently returned “Caitlin Rhodes.”


From a purely statistical view, the model’s behavior is understandable. Two names that share a first name, with one being the only “Caitlin” in a short registry list — the model’s priors pushed it toward a match. The prompt fought that instinct and lost.


This is a failure mode specific to smaller, faster models. Flash-Lite is optimized for speed and cost. In exchange, it’s more susceptible to exactly this kind of instruction-following degradation when its priors conflict with the explicit rules.


The Fix

Two changes:


First, upgrade from Gemini 2.5 Flash-Lite to Gemini 2.5 Flash. In a medical records context, the marginal cost difference is irrelevant. The stronger model follows instructions more reliably when its priors push in a different direction.


Second, add a calibration example to the prompt that covers this exact case:


**Input:** "Caitlin Mercer"
**Logic:** "Mercer" not in registry surnames. The registry contains "Caitlin Rhodes"
but a shared first name is irrelevant — the Surname Gate did not pass. STOP.
**Result:** Caitlin Mercer

LLM prompts behave like case law: general rules are good, but a specific precedent for the failure mode you’ve already seen is better. The model is less likely to override a rule when you’ve shown it the exact scenario where the rule applies.


The Deeper Issue

This bug is a version of a classic problem in probabilistic systems: you can build the right abstraction and still get the wrong answer if the underlying model’s learned behavior doesn’t map cleanly onto your abstraction.


The Surname Gate is conceptually correct. “Last name is your anchor” is exactly the right heuristic for this use case — first names are common, last names (especially uncommon ones) are discriminating. The prompt articulates this clearly.


But a language model is not executing your algorithm. It’s predicting what token comes next given all the context in the window, including a registry that happens to contain only one “Caitlin.” That prior is strong. A weaker model loses the battle between the prior and the explicit instruction.


The fix — better model, concrete example — works by making the counter-prior stronger than the model’s default pull toward matching. It’s not fixing the algorithm. The algorithm was fine. It’s fixing the gap between the algorithm as written and the algorithm as executed.


What’s in Place Now

The updated scenario uses Flash (not Flash-Lite), adds the calibration example, and tightens the phonetic similarity language to reduce subjectivity. “Phonetically similar” is now defined as “an exact or near-exact spelling variant” rather than relying on the model’s judgment about sound.


The fax system also has a manual correction UI — staff can always override the stored name directly in the fax viewer. That’s the safety net. The goal of the name service is to make that override unnecessary in the common case, not to be the last line of defense.


Caitlin Mercer is now in the registry. Her faxes will route correctly going forward. Caitlin Rhodes is unaffected.