Can We Trust AI with Our Health Notes? Cracking the Code on LLM Safety
Hey there! So, you’ve probably heard a ton about Large Language Models (LLMs) – those brainy AI that can chat, write, and even code. One of the really exciting places they’re starting to pop up is in healthcare. Imagine your doctor, instead of spending ages typing up notes after your chat, could have an LLM whip up a summary in a flash. Sounds pretty neat, right? It could free up so much time for actual patient care and maybe even cut down on that dreaded clinician burnout we hear so much about. We’re talking about a real boost to workflow efficiency!
But here’s the rub: when it comes to medical stuff, accuracy isn’t just nice to have; it’s absolutely critical. If an LLM summary gets things wrong, it could lead to miscommunication, dodgy diagnoses, or worse, compromise patient safety. That’s a scary thought, and it’s why my colleagues and I decided to roll up our sleeves and get to work on this.
The Sneaky Gremlins: Hallucinations and Omissions
LLMs, for all their smarts, have a couple of tricky habits. Sometimes, they “hallucinate” – which, in AI-speak, means they just make stuff up that wasn’t in the original information. Other times, they have “omissions,” meaning they leave out important details. You can see how either of these could be a disaster in a clinical note. A made-up symptom? A missed allergy? No, thank you!
The problem of hallucinations is a biggie. Some folks think it might even be an unavoidable quirk of how LLMs work. And while there’s a lot of research going into spotting and fixing these errors in general, we realized there wasn’t a clear picture of how often they happen in a clinical setting, why they happen, or what the real-world impact on patient safety could be. Even human-written notes aren’t perfect; studies show they average at least one error and four omissions! So, the bar is there, but we need to be super careful if AI is going to help, not hinder.
Our Game Plan: A Framework for Trust
So, what did we do? We came up with a multi-part framework to really dig into this. Think of it as a safety checklist and improvement toolkit for LLMs in medical summarisation. It has four key bits:
- An Error Taxonomy: This is basically a way to classify the different kinds of boo-boos an LLM can make. We didn’t just want to say “it’s wrong”; we wanted to know how it was wrong. For example, we broke down hallucinations into types like ‘fabrication’ (totally made up), ‘negation’ (saying the opposite of what’s true), ‘causality’ (guessing a cause without evidence), and ‘contextual’ (mixing up unrelated topics).
- An Experimental Structure: We set up a system to run experiments, change one thing at a time (like the instructions we give the LLM, known as “prompts”), and see what happens. Iteration is key!
- A Clinical Safety Framework: This is where we put on our doctor hats and assess how harmful an error could be. Is it a ‘minor’ hiccup, or a ‘major’ problem that could change a diagnosis or treatment plan if nobody caught it?
- CREOLA (Clinical Review of LLMs and AI): We built a cool graphical user interface to help us do all this. It’s a platform where clinicians can look at the LLM’s summaries, compare them to the original transcripts, and flag any errors. We named it in honor of Creola Katherine Johnson, one of NASA’s pioneering “human computers” – because just like she was vital for safe space missions, clinicians are vital for safely bringing AI into healthcare.
Putting LLMs to the Test: The Nitty-Gritty
We didn’t just theorize; we got our hands dirty! We ran a series of 18 experiments. For each one, we took 25 primary care consultation transcripts (real, anonymized doctor-patient conversations) and had an LLM (specifically, GPT-4, a well-known model) generate clinical notes. This gave us 450 pairs of transcripts and notes to scrutinize.
And scrutinize we did! We’re talking about a whopping 12,999 sentences in those clinical notes, all manually evaluated by a team of 50 medical doctors. Each note sentence was checked: Was it backed up by the transcript? If not, that’s a hallucination. We also looked at the transcript: Was there crucial clinical info that the LLM missed in its summary? If so, that’s an omission. And for every error, we asked: “Could this mess with the patient’s diagnosis or management?” That determined if it was ‘major’ or ‘minor’. If our two clinician reviewers disagreed, a senior clinician with over 20 years of experience made the final call. Talk about thorough!
What We Found: The Good, The Bad, and The Improvable
Across all those sentences, we found a 1.47% hallucination rate. That means about 1 or 2 sentences out of every 100 were made up or incorrect. Of these, a concerning 44% were ‘major’ – yikes! We also saw a 3.45% omission rate, with 16.7% of those being ‘major’.
Where did these gremlins like to hide? Major hallucinations popped up most often in the ‘Plan’ section of the notes (21% of them). This is super important because the ‘Plan’ section often has direct instructions for colleagues or patients. Fabrications were the most common type of hallucination. Negations (where the LLM says the opposite of what’s true) were also a big worry, making up 30% of all hallucinations and often appearing in that critical ‘Plan’ section.
Now, for some good news! Our framework wasn’t just about finding errors; it was about fixing them. By carefully tweaking our prompts and workflows, we saw some fantastic improvements. For example, just by refining the “style” instructions we gave the LLM (Experiment 1 vs. Experiment 8), we managed to slash major omissions and keep hallucinations mostly minor. In another set of experiments (Experiment 3 vs. Experiment 8), changing the prompt structure cut major hallucinations by 75% (from 4 down to 1 in 25 notes) and major omissions by 58%!
Interestingly, some things we thought would help actually made things a bit worse initially. For instance, a “chain-of-thought” approach, where we had the LLM extract facts *before* writing the note (Experiment 5), led to an increase in major hallucinations and omissions compared to our baseline. This was a super valuable lesson: you can’t just assume a fancy technique will work better; you *have* to test it rigorously, especially in medicine.
We also played around with getting the LLM to output notes in a structured JSON format, which is great for plugging into electronic health records. Through several iterations (Experiments 6, 9, 10, and 11), by refining prompts about subheadings and writing style, we completely eliminated major omissions (from 61 down to 0!) and cut total hallucinations by 25%.
Beating Human Error Rates? It’s Possible!
Here’s the really exciting part: in our best-performing experiments (like Experiment 8 with 1 major hallucination and 10 major omissions, and Experiment 11 with 2 major hallucinations and 0 major omissions, across 25 notes), the LLM actually made fewer errors per note than what’s typically reported for human-written notes (which, as I mentioned, average about 1 error and 4 omissions per note). That’s huge! It suggests that with careful engineering and validation, LLMs *can* achieve state-of-the-art, even sub-human, error rates for clinical documentation.
CREOLA: Our trusty Sidekick
I’ve mentioned CREOLA, our in-house platform, a few times. It was absolutely essential for this work. It’s designed to let clinicians easily identify and label those pesky hallucinations and omissions. Think of it as a safe sandbox where we can try out new LLM architectures or prompt strategies without any risk to actual patients. If an iteration makes things worse (like our chain-of-thought experiment initially did), we catch it in CREOLA before it ever gets near a real clinical setting. While we built CREOLA ourselves, there are publicly available platforms that could be used for similar tasks.
So, What’s the Big Picture?
Our study really drives home that while omissions might be more common than hallucinations (we saw 3.45% vs. 1.47%), hallucinations are much more likely to be ‘major’ errors (44% vs. 16.7%). This means hallucinations pack a bigger punch in terms of potential downstream harm if they’re not caught. And remember, they often lurked in the ‘Plan’ section – a critical area for patient safety.
It’s also clear that you can’t just throw an LLM at medical text and hope for the best. The way you prompt it, the workflow you design, it all matters. Our iterative approach, focusing on the clinical impact of errors, allowed us to make LLM outputs safer and more reliable.
A Few Caveats and What’s Next
Now, we’re scientists, so we always have to mention the limitations. Our sample size of medical transcripts was chosen to balance annotation effort with the number of experiments, but larger studies are always better. We also only tested one LLM (GPT-4). The world of LLMs is moving at lightning speed, with amazing open-source models coming out, and techniques like Retrieval-Augmented Generation (RAG) or Chain of Thought (CoT) are showing a lot of promise for making LLMs even smarter and more grounded in facts. So, a natural next step is to use our framework to test these newer models and methods.
Also, having humans review every single LLM output isn’t sustainable in the long run. It’s expensive and time-consuming. We’re really interested in the idea of “LLM-as-a-Judge” – using one LLM to check the work of another. This could help screen outputs, with clinicians then supervising and spot-checking. It’s a way to scale up this kind of safety assessment while keeping humans in the loop.
A Framework for the Future of AI in Medicine
Ultimately, we believe our framework – combining that error taxonomy, experimental structure, clinical safety assessment, and a platform like CREOLA – offers a solid template for any organization looking to use LLMs for clinical documentation. It’s about making sure these powerful tools are implemented safely and effectively.
The key, as we see it, is keeping clinicians at the heart of this process. Their expertise is irreplaceable when it comes to spotting errors that could impact patient care. By empowering clinicians to be key stakeholders in how LLMs are deployed, we can all work towards a future where AI genuinely supports healthcare providers, reduces their administrative burden, and helps them deliver even better care. And that’s a future I’m definitely excited about!
A Bit More on How We Did It (The Methodology)
For those who love the details, here’s a little more on our approach:
- Error Taxonomy Deep Dive: We classified hallucinations into:
- Fabrication: Information not in the text.
- Negation: Contradicting a clinically relevant fact.
- Causality: Speculating on causes without textual support.
- Contextual: Mixing unrelated topics.
Omissions were categorized by what was missed:
- Current issues: Details about the present condition.
- PMFS: Past medical, family, social history, medications.
- Information and plan: Discussions, explanations, management plans.
- Experimental Design: We were systematic. Each experiment varied specific parameters: the LLM prompt, workflow (e.g., adding a revision step by the LLM), or even comparing LLM notes to those written by our own clinicians! We always compared against a ‘baseline’ experiment, changing only one thing at a time to clearly see its effect. We used OpenAI’s GPT-4 (GPT-4-32k-0613) with consistent settings (seed 210, temperature 0, top-p 0.95) for reproducibility.
- Clinical Safety Assessment: This was crucial. We didn’t just count errors; we assessed their potential harm. Inspired by medical device certification protocols, we estimated the likelihood of an error and its potential impact on clinical outcomes. This gave us a risk score. For instance, ‘Very High’ likelihood meant errors were common (>90%), while ‘Very Low’ meant rare (<1%).
- CREOLA in Action: Our platform, CREOLA, was built as a Streamlit web app. It showed clinicians the LLM-generated note alongside the original transcript. To make review easier, it even highlighted the closest matching sentences between the two documents.
- Recruiting Our Annotators: We relied on volunteer doctors, compensated for their time. They received training (initially one-on-one, later an online course with a questionnaire) to ensure everyone was on the same page about how to identify and classify errors.
This rigorous, clinician-centered approach is what allowed us to not only measure the problem but also to demonstrate significant improvements, paving the way for safer and more reliable use of LLMs in the clinic.
Source: Springer