AI vs. Radiologist: 14 Years Tracking Your Spine’s Health
Hey there! Let’s talk about something pretty common but often a real pain in the… well, back. Lumbar disc degeneration (DD). It’s a big deal for folks dealing with low back pain, which, let’s be honest, is a *lot* of people worldwide. For ages, when we wanted to get a good look at what’s going on with those discs, we’d grab an MRI scan. And the go-to method for grading how worn out those discs are? The Pfirrmann classification. It’s neat, simple, and radiologists have been using it forever.
But guess what? The world of medicine is getting a serious tech upgrade. Artificial Intelligence (AI) is stepping into the ring, promising to analyze images faster and maybe even more objectively than we humans can. One of these AI contenders is called SpineNet. It’s designed to look at those same MRI scans and assign Pfirrmann grades automatically. Pretty cool, right? It’s been validated a bit, showing decent agreement with radiologists, but here’s the kicker: most of that validation has been on snapshots in time. What about tracking changes over the *long* haul? That’s where things get interesting.
The Long Haul: A 14-Year Look
So, a bunch of smart folks decided to put SpineNet to the test in a longitudinal study. That means they followed the same group of people over a significant period. In this case, it was 19 male volunteers, tracked over a whopping 14 years! They had MRI scans done when they were around 37 years old (that’s the baseline) and then again when they hit 51 (the follow-up).
The goal was simple but crucial: compare how SpineNet graded the disc degeneration using the Pfirrmann classification against how two experienced radiologists graded the *exact same* scans. They looked at individual discs and also calculated a Pfirrmann Summary Score (PSS) by adding up the grades for all the lumbar discs. This PSS gives a nice overall picture of degeneration in the lower back.
Human vs. Machine: The Grading Showdown
Now, you might expect the AI and the humans to be perfectly in sync, right? Well, not quite. The study found that there were some “notable discrepancies” between SpineNet’s grading and the radiologists’.
For starters, SpineNet seemed to be a bit more enthusiastic about assigning Pfirrmann grade 1 (which is like, barely any degeneration, or none at all) to several discs, especially in the upper lumbar spine (L1/2 to L3/4). The radiologists, on the other hand, didn’t see *any* grade 1 discs in this age group. In fact, in some cases, discs the radiologists thought were a moderate grade 3 were called grade 1 by SpineNet! That’s a pretty big difference.
On the flip side, SpineNet also assigned grade 5 (severe degeneration) to slightly *more* discs than the radiologists did, particularly at the lower levels (L4/L5 and L5/S1). There was even one instance where a disc graded as 3 by radiologists was called a 5 by SpineNet. This tendency for the AI to sometimes *underestimate* (grade 1) and sometimes *overestimate* (grade 5) the severity compared to the human experts is a key takeaway.

When they crunched the numbers on the Pfirrmann Summary Scores (PSS), the agreement between the first radiologist and SpineNet was rated as “poor”. The average PSS assigned by the radiologist was consistently higher than SpineNet’s at both baseline and follow-up.
However, when they looked at the agreement *between the two radiologists*, it was much better – ranging from “substantial” to “almost perfect”. This suggests the human experts were largely on the same page, while the AI had its own distinct way of seeing things.
Tracking Change Over Time
Despite the differences in *absolute* grading, both SpineNet and the radiologists did agree on one major thing: disc degeneration generally increased over the 14 years. This is pretty much what you’d expect as people age. Both methods showed a trend towards higher scores at the follow-up.
Interestingly, the study noted that SpineNet was quite consistent in its *own* grading over time, which is a plus if you’re using it specifically to track progression in longitudinal studies. It didn’t randomly jump around in its assessments.
Why the Discrepancies? And What Does It Mean?
So, why the difference between the AI and the human eye? The researchers point to a few possibilities.
* AI Training Data: SpineNet was trained on data graded by a single radiologist, and perhaps on a population with different characteristics (like mean age). The subjectivity of that initial training data could be influencing how SpineNet grades things differently from other radiologists.
* Human Subjectivity: While the two radiologists in *this* study agreed well, human interpretation *can* vary. AI aims for standardization, but it’s only as good as the data it learns from.
* MRI Scanner Differences: The study used different MRI machines at baseline (1.0 T) and follow-up (1.5 T). While SpineNet is designed to handle different scanners, it’s a potential factor.
The study also highlighted a crucial point: while AI like SpineNet is great at analyzing the *structural* changes it was trained on (like disc height, signal intensity, etc.), radiologists can spot a whole lot more. They can see things like infections, fractures, tumors, or other conditions that might be causing back pain but aren’t related to standard disc degeneration grading. These are critical findings that AI, in its current form, might miss.

Limitations and the Future
The researchers were upfront about a major limitation: the small sample size (only 19 participants completed both MRIs). This was due to participants dropping out over the 14 years. While they still saw clear differences, a larger study would provide more robust data.
Despite the discrepancies and limitations, the takeaway isn’t that AI is bad. Far from it! The study concludes that AI systems like SpineNet have “promise as complementary tools” in radiology, especially for consistent tracking in longitudinal studies. However, it strongly “emphasizes the need for ongoing refinement of AI algorithms.”
Think of it like this: SpineNet is a super-powered assistant. It can quickly process images and give you a grading based on what it’s learned. This could potentially speed things up and provide a consistent measure over time. But it doesn’t replace the experienced human radiologist who can use that information, look at the bigger picture, spot unexpected issues, and integrate it all with the patient’s clinical story.

So, while AI is making incredible strides in medical imaging, this 14-year look at lumbar disc degeneration grading shows we’re not quite at the point where we can hand over the reins entirely. It’s a powerful tool, but it needs more tuning and should be seen as a partner to the human expert, not a replacement. The journey to perfect AI in radiology continues!
Source: Springer
