The Hidden Treasure in Pathogen Data: Estimating Deferred Value
Hey there! Let’s talk about something pretty cool and super important for keeping us all safe from nasty bugs: pathogen genomic data. You know, the genetic blueprints of viruses, bacteria, and other infectious agents. The recent COVID-19 pandemic really shone a spotlight on just how vital this information is. It showed us the power of tracking pathogens in almost real-time, but also highlighted some frustrating bumps in the road when it comes to sharing that data globally.
What we’re diving into today is something called the “deferred value” of this data. Think of it like finding a hidden treasure map. The immediate value is knowing you *have* the map. The *deferred* value is what you find when you actually follow it and dig! It’s the value that emerges *after* the initial sequencing, when the data is shared, aggregated, and re-analyzed by others for different purposes.
Why Sharing is Caring (for Global Health)
Advances in understanding the genomes of microbes have totally changed how we see infectious diseases and how they spread. The COVID-19 experience kicked the use of sequencing for disease control into high gear, and sharing that data internationally became the bedrock for watching out for future pandemics and epidemics. Genomics-based surveillance is now a key part of disease control programs.
Big initiatives like the WHO Pandemic Treaty and International Health Regulations are pushing for more international data sharing. And honestly, the potential value you get from pooling all this international pathogen genomic data is just immense. Linking surveillance data means we get more accurate pictures, less bias, the ability to study health across populations, and even better case finding and contact tracing.
The utility of this data has grown way beyond just basic research. It’s now crucial for public health responses, modeling diseases, planning interventions, and even designing drugs and vaccines. This new value, especially from timely and high-quality data, is also essential for getting genomic diagnostics and therapeutics ready for the age of Artificial Intelligence.
But here’s the rub: a lot of the data shared in open databases is missing the crucial context – the ‘metadata’. Things like where the sample came from, when it was collected, or details about the patient. Without that, the data’s utility is seriously limited. Plus, there’s a growing mountain of sequencing data that exists as “UJAD” – “unpublished in journals and available in databases” – which is basically unseen and underutilized globally. As data explodes, focusing on quality, especially for AI and deep learning, is becoming super important.
Our Cholera Case Study
To get a handle on this deferred value, we looked at a specific example: Vibrio cholerae genomes. Cholera is a nasty disease that’s still endemic in many parts of the world, causing millions of cases each year. It’s also a target for elimination, making its genomic data particularly valuable for understanding spread and developing tools like vaccines.
We gathered data on over 10,000 V. cholerae genomes shared on international repositories between 2010 and early 2024. What we saw was fascinating:
- Exponential Growth: The amount of genomic data available has grown exponentially over the last 15 years. This reflects better access to sequencing tech, even in countries hit hard by cholera.
- Shifting Data Providers: Initially, academic institutions were the main source of data. But over time, microbiology service providers – like public health and diagnostic labs – have dramatically increased their contributions. By 2023, they were submitting over 80% of the genomes! This is a big deal, showing sequencing is moving from research labs to the front lines of public health.
- Geographic Gaps: Despite the increase, most data still comes from countries with advanced economies. Regions where cholera is endemic remain seriously under-represented in the datasets. This is a major issue because data from high-prevalence areas is often intrinsically more valuable for understanding global diversity and spread.
- Metadata Matters (and is Often Missing): We found significant variability in data quality and the availability of key metadata. Academic institutions were less likely to include raw sequencing data or crucial details like the sample source, year, or country of origin compared to microbiology labs.
- The Time Lag Problem: This was perhaps the most striking finding. The time between collecting a sample and submitting its sequenced genome to a database remained stubbornly high, averaging a whopping 8 years! And this hasn’t improved over time. We even found examples where the lag exceeded 50 years (likely from sequencing archived samples, which has its own value, but isn’t timely surveillance).
Defining and Measuring Deferred Value
So, how do we actually *estimate* this deferred value? We proposed a framework where the value of a single shared genome depends on four things:
- Data Quality: Is the sequence accurate and complete?
- Novelty: How unique is this sequence compared to what’s already available?
- Associated Metadata: How much contextual information (source, date, location, clinical details) is included?
- Timeliness: How quickly was the data shared after the sample was collected?
By combining scores for these factors, we can get an estimate of a single genome’s potential value for secondary use. We also developed a “value index” for collections of genomes, which considers the average value of individual genomes but also penalizes datasets that lack genomic diversity (e.g., lots of super similar sequences from one outbreak). A diverse dataset is generally more representative and valuable for broader analysis.
What the Cholera Value Index Showed
Applying our framework to the V. cholerae data revealed some interesting patterns about value:
- The aggregated value index for genomes submitted by microbiology service providers (public health/clinical labs) was significantly higher than for those from academic institutions. This makes sense, as labs often collect samples for public health purposes and may be better positioned to provide relevant metadata, even if academic labs were early adopters of the tech.
- While the annual value from academic submissions stayed relatively stable, the cumulative value from microbiology labs steadily increased.
- High, medium, and low-value genomes were found across different types of V. cholerae strains, not just the pandemic ones.
This suggests that the shift towards public health labs as data providers is actually increasing the *value* of the shared data over time, even if issues like geographic representation and overall timeliness persist.
The Roadblocks and the Rewards
Genomic data is a bit tricky economically. It’s a “non-rivalrous good” – meaning one person using it doesn’t stop someone else from using it, and it’s easy to copy but hard to protect or assign a direct monetary value to. The cost of sequencing doesn’t necessarily equal the true value of the resulting data.
Often, the folks who generate the “raw” data (especially in lower-resource settings) might feel a bit left out when their data is used internationally without much recognition or equitable benefit. There’s a real risk of unequal reuse.
Our framework, we hope, helps estimate this deferred value and highlights the potential benefits for *both* the data providers and the users. It could help identify which genomes are most valuable to sequence and share, encourage faster sharing, and even point out redundant data. It also makes a strong case for investing in microbial genomics and surveillance to be ready for future pandemics.
The increasing role of microbiology labs is a positive sign of technology maturation, but it also brings challenges. Unlike research data (often shared upon publication), there’s no general expectation for health service data to be shared. This could mean a huge volume of valuable “UJAD” data remains locked away.
The persistent, long time lag between sample collection and data submission for cholera genomes is concerning, especially compared to the dramatic reduction seen during COVID-19 (from 85 days median in 2020 to 19 days in 2021 for SARS-CoV-2 data). We need to encourage uploaders to provide the sequencing date to better understand these delays.
Sharing high-value sequences promptly is key. We found an example of a crucial drug-resistant V. cholerae strain that wasn’t shared until a year after sampling and published a year after that. How do we incentivize faster sharing? Maybe through evidence-based arguments to funders about the benefits of open data, or perhaps a data marketplace model that offers some form of compensation or recognition.
The under-representation of data from endemic areas is another major hurdle. Just like with COVID-19, where most SARS-CoV-2 data came from high-income countries, this creates significant gaps and biases in global datasets. Data from low-income, high-prevalence countries is incredibly valuable because it helps fill these gaps and provides crucial diversity.
Even sequencing old, archived samples can add value by providing historical context and diversity not found in recent data, as seems common with V. cholerae. This all reinforces the need for equitable and timely sharing with good metadata.
Moving Forward
To make shared data truly useful, we need better metadata standards. Using things like the Genomic Epidemiology Ontology (GenEpiO) and other interoperability standards is crucial. Metadata should capture potential applications – public health surveillance, drug resistance tracking, AI training, etc.
The current approach of accepting data with minimal metadata to lower barriers might actually be counterproductive in the long run, as it immediately reduces the data’s value and puts a huge burden on secondary users to figure things out. We need systems that respect different legal and cultural data protection norms but still facilitate value creation and, importantly, reward those who create that value by sharing.
Now, our approach isn’t perfect. We focused mainly on data accuracy and completeness, not things like credibility. Our value scores are a basic estimate of non-monetary value, focused on reusability, and don’t capture the immediate value for clinical care or local surveillance. The value can also vary greatly depending on who is using the data and what they’re using it for. And our cholera dataset, while representative of global trends, might have a slight bias towards higher-quality submissions.
But we believe this framework provides a valuable starting point for estimating the “day-after-tomorrow” value of pathogen genomic data – the value that will drive data mining, meta-analyses, collective intelligence, and machine learning in the future.
With the constant threat of new epidemics and the increasing power of AI, making the case for promoting and using the deferred value of microbial genomic data is more compelling than ever. We absolutely need to address the inequalities in data production and analysis, recognize the collective value created by sharing, and support it properly. The current disconnect between the growing volume of public data and the lack of good metadata is a major barrier to unlocking this hidden treasure trove.
Our deferred value framework is designed to encourage prompt sharing by highlighting the increased value when sequences are accompanied by appropriate metadata and submitted quickly after collection. Assessing this value should help pave the way for truly international mobilization of quality microbial genomic data for global health and knowledge discovery, where shared value is acknowledged and rewarded.
Source: Springer