Prompt: A large language model interface overlayed on clinical trial data, representing AI analyzing complex information, high detail, precise focusing

LLMs Tackle Clinical Trial Protocol Deviations: A Game Changer

Hey There, Let’s Talk Clinical Trials!

You know, working in clinical development is pretty fascinating. We’re constantly pushing boundaries to find new treatments and improve patient lives. But let me tell you, it comes with its fair share of challenges. One big one? Keeping track of something called Protocol Deviations, or PDs for short.

Now, what exactly is a PD? Well, the official definition from the folks at ICH (that’s the International Council for Harmonisation, sounds important, right?) is “any change, divergence, or departure from the study design or procedures defined in the protocol.” Simple enough, you’d think. But here’s the rub: how people interpret and handle these deviations can vary wildly across different institutions, sponsors, investigators, and review boards. It’s a bit of a messy area, honestly.

The Headache of Manual PD Management

We’re supposed to follow the protocol to the letter – it’s all about ensuring ethical standards, scientific quality, and protecting the folks participating in the trials. Good Clinical Practice (GCP) guidelines are super clear: follow the rules, report deviations, and trend them to maintain data integrity, patient safety, and stay on the right side of the regulators. Identifying and reporting PDs in a timely manner is crucial for mitigating risks and keeping our study data reliable.

Despite initiatives to make things clearer, like TransCelerate’s efforts, systematically identifying and trending the *impactful* PDs – the ones that really matter for data reliability or subject safety – has been limited. Why? Because managing PDs is still largely a manual process. We’re talking about sifting through tons of unstructured, non-standardized text descriptions. It’s inconsistent, prone to errors, and makes it tough to spot systemic issues across a study or even multiple studies. Standardizing this classification is essential, but applying it consistently? That’s the hard part.

These free-text descriptions are full of valuable info about things like informed consent, who should or shouldn’t be in the study (inclusion/exclusion criteria), the treatments given, and medications that weren’t allowed. But traditional methods, like the Natural Language Processing (NLP) techniques explored by some clever folks, often require a ton of upfront work. Think extensive feature engineering and training models on massive datasets that have been manually labeled – a seriously time-consuming and resource-intensive task.

Enter the LLMs: A Glimmer of Hope

But then, something exciting happened. The rise of Large Language Models (LLMs) started revolutionizing how we handle text. These models are way more accurate, nuanced, and understand context in a way that traditional NLP often struggled with. We saw this as a potential game-changer for our PD problem. What if we could use an LLM to automatically classify these messy, free-text PD descriptions?

We figured there had to be a better way than months of manual analysis. We wanted an automated solution that was efficient, flexible, and could specifically target the kinds of PDs that were most critical. So, we decided to try something new.

Our bright idea? Leverage an LLM, specifically Meta Llama2, with a specially crafted prompt, to classify the free-text PDs coming straight out of our PD management system here at Roche. Our main goal was to identify those PDs that could potentially mess with how we assess disease progression in our clinical programs. This is super important because issues here can impact study endpoints and, most importantly, patient safety.

Currently, checking for these issues involves manual, siloed reviews. Different teams look at different aspects at different times, often missing the big picture or how things connect. This fragmented approach increases the risk of undetected PDs that could compromise data integrity or impact key study analyses.

So, we set out to address this by:

  • Using the LLM to label PDs that directly or potentially affect the timeliness or accuracy of disease progression assessments.
  • Integrating these labeled PDs with other relevant clinical data.
  • Presenting everything through easy-to-understand visualizations to help our expert reviewers investigate efficiently.

And because we’re dealing with sensitive clinical data, we made sure everything was handled securely within our environment, with patient data anonymized and strict adherence to our data ethics principles.

Prompt: Two clinical researchers collaborating in a secure data room, looking at screens displaying complex clinical trial data visualizations, 35mm portrait, depth of field, controlled lighting

Building Our LLM Solution: Prompts and Process

Unlike using LLMs for chatting, we adapted Llama 2 for large-scale classification. Our approach involved a few key steps:

  1. Prompt Strategy Development: Figuring out exactly what we needed the LLM to do and how to ask it effectively. This involved considering things like how LLMs process text (tokens!), their context window limitations (can’t feed them *everything*), and how to manage randomness in their responses using parameters like Temperature and Top P. For classification, especially when we wanted to catch *most* potential issues, we leaned towards settings that allowed a bit more flexibility rather than being strictly deterministic.
  2. Prompt Template Development: We started with a basic template – telling the model its role, giving instructions, defining criteria, and showing it the text to classify. We manually tested this template with a small sample, refining it iteratively by adjusting wording and parameters.
  3. Batch Classification Process: Once we had promising templates, we automated the process. We’d pull PDs from the database, format them into prompts, feed batches to the LLM, collect the classifications (DP, Maybe DP, No DP), and store the results.
  4. Prompt Evaluation: This was crucial. We needed to see how well our prompts actually performed. We built a reliable dataset by having two senior quality experts independently review a sample of PDs and reconcile any differences to create a ‘ground truth’. Then we tested our different prompt versions against this ground truth using standard metrics: Precision and Recall.

Let’s talk about Precision and Recall for a second. Precision tells you how many of the things the model *said* were relevant actually *were* relevant. Recall tells you how many of the things that *were* relevant the model actually *caught*. In our case, identifying PDs that impact disease progression is like finding needles in a haystack – they’re infrequent but high-impact. Missing one could be a big deal. So, we deliberately prioritized Recall. We wanted the model to flag as many potential cases as possible, even if it meant flagging some false positives, because the burden of reviewing a few extra records is far less than the risk of missing a critical one.

We tested three different prompt versions. Interestingly, the simplest one, a ‘zero-shot’ version that didn’t include specific examples in the prompt, achieved the highest recall – over 80%! This version included clear instructions, classification criteria, and reinforcement, telling the model exactly what to focus on and what format to use for the output. It was refined for solid tumor studies using specific criteria (RECIST 1.1), but the cool part is it can be adapted for other types of studies or even other classification tasks.

Prompt: A close-up shot of a computer screen displaying complex data points and classification labels, representing the output of an AI model, high detail, 100mm macro lens, precise focusing

The Results: Faster Insights, Better Oversight

So, what did this LLM magic get us? It flagged over 80% of the PDs that our experts agreed potentially affected disease progression assessment. Think about that: identifying these critical issues went from potentially months of painstaking manual review to getting actionable insights in minutes! This speed allows our quality professionals to quickly see program-level risks and identify areas that need immediate attention or process improvement.

But it wasn’t just about getting a label. Remember how I mentioned integrating the results? This is where the decision *support* part comes in. We built interactive dashboards using Tableau® to visualize the LLM’s classifications alongside other crucial clinical data. This allows our experts to dynamically explore the results.

For example, they can see the proportion of potentially impactful PDs grouped by category (like Procedure, Medication, etc.). They can trend these types of PDs over time across different studies. They can even assess the timeliness of how these specific PDs were addressed by product, study, or key process step.

One particularly powerful visualization shows timelines for individual patients who might have reached disease progression based on internal calculations but weren’t formally diagnosed by the investigator within a certain timeframe. By putting different clinical events (like assessments, treatments, adverse events) on a timeline for these flagged patients, experts get a quick overview and can drill down into specific cases, using the LLM’s explanation for its classification as a starting point for their investigation.

This integrated approach, combining the LLM’s high-recall labeling with contextual clinical data and intuitive visualizations, is key. It mitigates the burden of reviewing false positives flagged by the high-recall model and addresses the limitations of the old, manual, siloed review process. It’s not the LLM making the final call; it’s a powerful tool enhancing human oversight and making the review process more efficient and comprehensive.

Navigating the Roadblocks: Pharma, AI, and Reality

Now, adopting new tech, especially AI, in a highly regulated environment like clinical development isn’t without its challenges. The pharmaceutical industry can be risk-averse, and integrating AI isn’t just about plugging in a model; it requires aligning human expertise, business needs, and the technology itself.

We learned that setting realistic expectations is crucial. LLMs are fantastic at processing unstructured text and finding patterns, but they aren’t magic bullets or standalone decision-makers. Our task of identifying PDs affecting disease progression, while business-critical, is inherently complex and involves nuance, which is why integrating the LLM’s output into a decision-support framework for expert review is so effective.

One big hurdle is validation. How do you validate an AI system, especially one using LLMs that can be sensitive to prompt phrasing? Existing computer system validation (CSV) frameworks weren’t built for this, and regulatory guidelines are still catching up. Establishing a large, perfectly validated ‘ground truth’ dataset is also incredibly difficult – even our experts had disagreements! This highlights the need for continuous monitoring and robust governance frameworks to address potential biases, ensure fairness, and maintain audit trails.

Another limitation is that the model can only work with the data it’s given. If a PD description is vague or incomplete, the LLM (like a human reviewer) might struggle with interpretation. Subjectivity is inherent in interpreting complex clinical events, and there’s no such thing as a perfect model or a perfect human expert. Prompt sensitivity means small changes in how you ask the question can affect the answer, so testing and monitoring consistency are important.

Ultimately, our LLM solution is a technical capability to *support* decision-making, not replace it. The responsibility for reviewing PDs and ensuring patient safety and study integrity still rests firmly with our quality professionals.

Prompt: A close-up shot of a person's hands typing on a keyboard in a modern office, with blurred clinical data on screens in the background, representing the human element in data processing, 35mm portrait, precise focusing

Looking Ahead: More Than Just PDs

What we’ve done here is propose and demonstrate a practical way to use LLMs to automatically label important PDs, making trending and analysis much more efficient and targeted. This approach is great for extracting deep insights from that messy, unstructured free text that traditional methods struggle with. It complements existing best practices and provides a flexible solution for classification.

The exciting part? This framework isn’t limited to just classifying disease progression PDs. It can be adapted for classifying other types of clinical research data or tackling similar challenges across clinical development. Imagine using it to quickly identify issues related to informed consent compliance or dosing regimens across thousands of records!

Scaling this up for broader adoption will require continued effort on both the technical and organizational fronts. Technically, we can always improve by using the latest LLM models and refining our prompt engineering techniques, testing on even larger datasets. Organisationally, it means investing in identifying the most impactful use cases, building clear validation frameworks tailored for AI, and actively addressing compliance and ethical considerations.

This experiment shows the feasibility and potential value of bringing advanced LLMs into the clinical development space. It’s about leveraging these powerful tools to enhance human expertise, streamline complex processes, and ultimately contribute to getting safe and effective treatments to patients faster.

Source: Springer

Articoli correlati

Lascia un commento

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *