Are Your EHRs Up to Snuff? Nailing Data Quality for Super-Smart Joint Models!
Hey there, data enthusiasts and healthcare innovators! Ever feel like we’re sitting on a mountain of gold with all those Electronic Health Records (EHRs), but not quite sure if our digging tools are sharp enough? I know I do! Over the last ten years, doctors and hospitals have been jumping on the EHR bandwagon, which is fantastic. But let’s be real, this data can be a bit… messy. We’re talking about issues with completeness and overall data quality. And when we want to use sophisticated tools, like the joint models we’re chatting about today, it’s a bit of a mystery how much these data quirks really mess things up.
So, what’s the big deal with these joint models? Imagine you’re tracking something in a patient over time, like a biomarker, and you also want to predict a major health event, like a disease diagnosis. Joint models are clever because they look at both the longitudinal data (the biomarker changing over time) and the survival data (when the event happens) all at once. They help us use every scrap of info we have. The big question I wanted to tackle was: how good does our longitudinal EHR data *really* need to be for these joint models to actually be better than our old friend, the Cox model?
Our Grand Investigation: Simulating for Success!
To get to the bottom of this, I embarked on a pretty extensive simulation study. Think of it like building a giant, digital laboratory where we can create and test data under all sorts of conditions. We systematically played around with different aspects of data quality, such as:
- How often measurements were taken (measurement frequency)
- How much ‘random noise’ was in the data
- How different patients’ data looked (heterogeneity)
Then, we threw our joint models at this simulated data and compared their performance against the traditional Cox survival models. What we found was pretty illuminating! One key thing is that if a biomarker is going to change before a disease pops up, those changes need to be fairly consistent among similar groups of patients. And here’s a juicy bit: when the data gets noisier and we have more frequent measurements, that’s when the joint model really starts to flex its muscles and outperform the Cox model.
To show you this isn’t just theory, we looked at a couple of real-world scenarios: how serum bilirubin levels affect primary biliary liver cirrhosis and how the estimated glomerular filtration rate (eGFR) plays into chronic kidney disease. It’s all about seeing if these guidelines hold up in the wild!
Why Bother with Early Predictions?
Let’s face it, catching diseases early is a game-changer. If we can spot patients at high risk for a clinical diagnosis sooner rather than later, we can make a real difference. Statistical models are our trusty sidekicks in this mission, helping us predict health conditions based on things like changing biomarker levels. The idea is simple: tiny shifts in these levels might be whispering secrets about our health, hinting at a diagnosis that’s still down the road. Joint models, by combining this unfolding story (longitudinal data) with the eventual outcome (survival data), offer a powerful way to listen to these whispers.
Research has already shown these models can give us a clearer picture and better estimates than just looking at static survival data. But, and it’s a big but, you need good quality data for both parts. EHR data is a treasure trove – lab results, diagnoses, treatments, symptoms – it’s all in there! However, it comes with its own set of headaches, especially when we’re trying to get it ready for analysis. We’re talking missing bits, inconsistencies, and straight-up inaccuracies. For primary care data, it can be even trickier with irregular patterns and figuring out what’s relevant.
Until now, nobody had really systematically checked how joint models cope with the kind of noisy, real-world data we get from EHRs. When do they actually give us an edge in precision compared to something more established like Cox regression? That’s the gap I wanted to fill.
Peeking Under the Hood: Our Simulation Setup
Alright, let’s get a bit geeky – how did we actually cook up this simulated data? My goal was to create realistic primary care and hospital data that we could use to see how data quality affects our disease progression models. Here’s the basic recipe:
- Patient Timelines: We imagined patients entering our study at a starting point (tstart) without the disease we’re interested in. We’d “watch” them for 5 years, collecting both survival and longitudinal data until an end point (tend). After that, we’d follow them for another 5 years, but only for survival data (no more biomarker measurements).
- Data Scaling: We scaled all longitudinal data (like biomarker levels) to fit within a 0 to 1 range. This just makes things comparable.
- Balanced Groups: We kept things even with a 50-50 split between patients who would eventually get the disease and those who would stay healthy.
We generated three main types of data:
- Survival Outcome: Did the patient get diagnosed during the follow-up, or not?
- Longitudinal EHR Data: Those biomarker measurements taken between tstart and tend.
- Baseline Characteristics: Things like sex and age, which can influence survival and differ between cases and controls.
For simplicity, we focused on a single biomarker. The diagnosis time for sick patients was randomly set between 10 and 119 months into the study, meaning there was always some longitudinal data before the event. Healthy patients were “censored” at 120 months (no diagnosis within the study period).
The Nitty-Gritty of Data Generation
With biobanks and EHR data becoming more common, we can build better predictive models. But these datasets are often messy – varying numbers of measurements, different levels of precision. That’s why realistic simulation is key.
In the real world, biomarker measurements aren’t usually taken on a neat schedule. So, we simulated both the frequency of measurements and the timing. The number of measurements for each patient was drawn from a normal distribution, reflecting real-world variability. Interestingly, we found that in the UK Biobank, a typical biomarker might only be measured once every two years on average, with a standard deviation of 2. If you filter for patients with at least 2-3 measurements (because we need longitudinal data!), the average goes up, but your sample size shrinks.
The actual measurement time points were random integers within the 60-month observation window. The core of our simulation was generating the longitudinal EHR data, y(tij), before any disease onset. We used a linear mixed-effects model, a popular choice for tracking changes over time. Essentially, we assumed that for a patient developing a disease, their biomarker level would be constant until a certain “breakpoint” (tm), after which it would start to change (increase, in our setup) linearly, hinting at the upcoming diagnosis. For healthy patients, the slope was assumed to be zero.
Not every patient who’s going to get sick will show these early biomarker changes. So, we introduced a “response probability” (presp). We also allowed for differences in baseline biomarker levels between those who would get sick and those who wouldn’t. And, of course, we added some random “noise” to each measurement, because real-world measurements are never perfectly precise. This setup allowed us to tweak various data quality parameters and see what happens.
The Contenders: Joint Models vs. Cox Models
To see how different data quality characteristics affected predictive power, we pitted our joint modeling approach against the standard Cox model. We used a nifty tool called the time-varying concordance index (C-index) to score them. This helps us see how well the models predict risk over different time periods. The Cox model, in our comparison, used the most recent known biomarker value. The joint model, on the other hand, gets to look at the whole trajectory of available information, which means it can handle data gaps without needing fancy imputation methods.
Joint models are pretty cool because they have two parts: a longitudinal submodel (for the biomarker over time) and a survival submodel (for the time-to-event). We assumed the longitudinal outcomes followed a linear path and were normally distributed. The survival part looked at the relative risk. These two parts are then “jointly” optimized, usually with some clever statistical techniques like the EM algorithm or Bayesian MCMC methods. We used the R package JMbayes2 for this heavy lifting.
The models we compared were:
- The Full Monty Joint Model: Using biomarker measurements from the past 5 years, plus covariates like sex, age, and smoking status.
- The Baseline Cox Model: Only age and sex, no EHR biomarker data. (Our control group, if you will).
- The Enhanced Cox Model: Age, sex, AND the most recent biomarker measurement from the 5-year observation period.
This setup ensured a fair fight, especially since the Cox model uses the biomarker at the end of the observation period (tend), while the joint model gets to use data leading up to it.
How We Scored Them: The Time-Varying C-Index
Evaluating risk models isn’t just about one number. We want to know how accurately they identify who’s at higher risk, and this can change depending on the timeframe you’re looking at. That’s where the time-varying C-index comes in. It’s an adaptation of the regular C-index that lets us see how model performance changes over time. Essentially, it looks at pairs of patients – one who develops the disease by a certain time and one who is still healthy at that time – and checks if the model correctly predicted a higher risk for the patient who got sick. A C-index closer to 1 means better performance. We looked at the mean time-varying C-index over the 5-year follow-up period after the last possible measurement.
The Big Reveal: What Our Simulations Told Us
Okay, so after all that simulating and comparing, what did we learn? We looked at how various parameters influenced the mean C-Index. Here are the highlights:
- Sample Size Matters (Duh!): With tiny sample sizes (like N=50), the Cox model and joint model performed pretty similarly. But once we bumped it up to N=200 or more, the joint model started to show a clearer advantage. More data generally means more robust predictions and narrower prediction intervals. So, aim for at least 200 subjects if you can.
- Measurement Frequency – More Can Be Better (Up to a Point): When we had at least one measurement per year, the joint model generally did better. However, after a certain point, just adding more and more data points didn’t really boost performance much further. The takeaway? At least one measurement per year is a good rule of thumb for joint models.
- Noise? Bring it On (for Joint Models!): This was interesting! As the ‘noise’ in the measurements increased (think less precise readings), the joint model actually showed a greater advantage over the Cox model. It seems joint models are better at filtering out this noise. If your data is pretty clean (low noise), the difference between the models might be minimal. But if things are a bit messy (noise variance around or above σe=0.075), the joint model might be your hero.
- Timing of Biomarker Changes: If biomarker changes happen only shortly before diagnosis, the joint model is better at picking this up than the Cox model (which only sees the last value). If the change happens over a longer period, the difference isn’t as stark. This one’s tricky to know beforehand in real data, though.
- Baseline Differences: If there’s already a difference in biomarker levels at the start between those who will get sick and those who won’t (an intercept difference), the joint model seems to benefit more from this than the Cox model. So, if you suspect an intercept difference greater than about Δb ≥ 0.1, the joint model could be particularly useful.
Time Matters: Response Rate and Slope Variance
We also looked at how performance changed over the follow-up time for a couple of other factors: the percentage of patients actually showing the biomarker-disease link (presp) and how much the individual biomarker slopes varied (σm).
It turns out, if only a small fraction of patients show that biomarker response, neither the joint model nor the Cox model (with biomarker) will do much better than a basic Cox model without any biomarker info. You really need a good chunk of patients (like 80% or more) to be “responders” for these models to shine. High heterogeneity (lots of variation in how patients respond) makes it tough to spot the pattern.
Regarding slope variance, if the biomarker slope is pretty similar across all sick patients (low variability), the Cox model does quite well. As heterogeneity in the slope increases, the performance of the Cox model (with biomarker) becomes more comparable to the joint model.
An interesting pattern emerged: for predictions closer to the last measurement time, the models performed more similarly. The joint model’s advantage often became more apparent when predicting further out, likely because of its ability to model the entire trajectory.
Drumroll, Please… Our Data Quality Guidelines!
The Cox model is simpler to use and explain. So, we should only really bother with the more complex joint model if it’s likely to give us a real performance boost. Based on all our simulations, I’ve put together a sort of checklist. Remember to normalize your longitudinal measurements to a [0, 1] range first!
Consider a Joint Model if:
- Sample Size (N): You have at least N ≥ 200 patients.
- Measurements per Year (nabs): You have at least 1 measurement per year on average.
- Measurement Noise (σe): The noise standard deviation is σe > 0.075 (on the [0,1] scale).
- Years of Assumed Slope (tm): The biomarker changes are expected to occur for less than 3 years before diagnosis (joint model excels here).
- Intercept Difference (Δb): There’s a baseline difference in the biomarker between cases and controls of Δb ≥ 0.1 (on the [0,1] scale).
- Response Rate (presp): You expect a high proportion (ideally ≥ 0.8) of patients who develop the disease to show the biomarker change. (This is often unknown but can be improved by focusing on specific patient subgroups).
- Slope Standard Deviation (σm): The variability in the biomarker slope among diseased patients is relatively low (e.g., σm ≤ 0.005 on the [0,1] scale per month).
Most of these can be estimated from your real-world data. The response rate is the trickiest, but careful cohort selection can help!
Putting Theory into Practice: Real-World Test Cases
To see if these guidelines actually work, we tested them on two real datasets.
Example 1: Primary Biliary Cirrhosis (PBC) and Bilirubin
PBC is a slow, progressive liver disease. We used data from a Mayo Clinic trial (312 patients) looking at the drug D-penicillamine. We focused on bilirubin levels. After normalizing the data and estimating the parameters from our checklist, guess what? All the requirements in our guidelines were met! So, we expected the joint model to outperform the Cox model. And voila! As Figure 7 in the original paper shows, the joint model indeed gave us better prediction accuracy, especially for prediction intervals longer than a year. Score one for the guidelines!
Example 2: Chronic Kidney Disease (CKD) and eGFR from UK Biobank
Next, we looked at estimated glomerular filtration rate (eGFR) to predict CKD risk using UK Biobank data. After a bunch of preprocessing (joining units, handling outliers, etc.) and matching sick patients with healthy controls, we checked our guidelines again. This time, it was a mixed bag. Some requirements were met (like sample size), but others weren’t (noise was low, not many measurements per year). So, the guidelines suggested it was unclear if the joint model would offer an advantage. And indeed, when we ran the models (Figure 8 in the original paper), there was no significant performance gain for the joint model over the Cox model. Score two for the guidelines – they helped us anticipate this!
So, What’s the Bottom Line?
Identifying patients at risk as early as possible is a huge deal, and statistical models like joint models are powerful tools. With EHR data becoming more available, the potential is massive. But, as we’ve seen, data quality really matters for how well these sophisticated models perform.
My study pinpointed several critical factors:
- Joint models tend to beat traditional Cox regression when data is noisy and when you have more frequent longitudinal measurements.
- How uniformly patients progress (i.e., how many show a response and how consistent their biomarker slopes are) significantly impacts joint model performance.
- Bigger sample sizes help all models by giving more accurate, less wobbly predictions.
The key takeaway? Yes, there are definitely situations where joint models are the star player. But there are also plenty of times when they don’t offer a big enough advantage over the simpler Cox model to justify the extra complexity. That’s why I developed that checklist – to help you figure out if your data has the right stuff for a joint model to truly shine.
Our real-world examples with PBC and CKD showed that these guidelines can actually predict whether you’ll see a benefit. When the data characteristics lined up with the guidelines, the joint model did better. When they didn’t, it didn’t. Simple as that (well, almost!).
Looking Ahead: What’s Next on the Horizon?
This research opens up some exciting avenues for future exploration. For instance, we haven’t dived deep into how these different data quality parameters might interact with each other. It’s likely a complex dance between noise, sample size, and the effect you’re trying to measure!
We also kept things a bit simple by not modeling an increasing frequency of measurements over time, which can happen with some biomarkers. And due to the sheer computational effort, we only looked at a limited number of variations for each parameter. Digging deeper into individual parameters could give us even more nuanced insights.
The joint models themselves can get fancier too. We assumed a linear slope after a breakpoint, but real-world data can be more complex. There are other ways to link the longitudinal and survival parts of the model, like looking at interaction effects or time-dependent slopes. And what about when you have multiple types of events or several biomarkers to track at once? That’s where things get really interesting (and complex!).
It would also be super valuable to understand EHR data quality across different sources and healthcare systems. We used UK Biobank data, where quality was a bit of a challenge for the joint model in one case. How does this compare to other big biobanks? Expanding this kind of data characterization would make our checklist even more robust and widely applicable. Different EHR systems have different quirks, so robust preprocessing and quality control are always going to be crucial. We might even need to explore more advanced data smoothing techniques if basic quality checks aren’t enough.
We didn’t get into imputation methods for missing data much here. The Cox model used the last available measurement, and joint models can inherently handle some missingness. But imputation for longitudinal data is a whole field in itself and could be a topic for future work.
Finally, finding those nice, homogeneous subgroups of patients is key for model performance. We can use knowledge-based approaches (focusing on specific disease subtypes) or even data-driven clustering techniques to find groups of patients with similar trajectory patterns. This could really help tease out clearer signals from the noise.
The Grand Finale: Making Sense of It All
So, what have we learned on this journey through simulations, models, and real-world data? I think the big message is that while joint models are incredibly promising for squeezing every last drop of insight from our longitudinal and survival data, they’re not a magic bullet. They really shine when the conditions are right – particularly with noisy data, frequent measurements, and relatively consistent patient responses.
The guidelines I’ve developed are a starting point, a way to help you, the researchers and clinicians, make an informed decision about whether the extra effort of a joint model is likely to pay off for your specific dataset. By carefully considering the characteristics of your EHR data, you can choose the right tools for the job, ultimately leading to more reliable analyses and, hopefully, better health outcomes.
This work is all about empowering you to tackle those EHR data quality issues head-on and make the most of these powerful modeling techniques. It’s an exciting time to be working with healthcare data, and with the right approach, we can unlock even more of its potential!
Source: Springer