Photorealistic macro shot of SARS-CoV-2 virus particle, 105mm macro lens, high detail, precise focusing, controlled lighting.

Cracking the Viral Code: AI Predicts Future Variants

You know, viruses are constantly changing the rules on us. It’s like they’re playing a never-ending game of evolutionary hide-and-seek, and we’re always trying to catch up. The COVID-19 pandemic really hammered this home, with SARS-CoV-2 throwing new variants at us left and right, each one seemingly better at spreading than the last. This ability to evolve and spread more effectively? That’s what we call viral fitness.

Understanding Viral Fitness

Think of viral fitness as a virus’s superpower for reproduction. It’s not just about how well it replicates inside a single cell, but how effectively it spreads through a whole population. For SARS-CoV-2, this often boils down to changes in its Spike (S) protein. This protein is the key the virus uses to unlock our cells (by binding to the ACE2 receptor), and it’s also the main target for our immune system’s defenses, like neutralizing antibodies. So, mutations in the S protein that help the virus bind better or, crucially, dodge those antibodies, can give it a serious fitness boost.

In the scientific world, we often measure this fitness using something called the relative effective reproduction number (Re). It’s basically a way to compare how well one variant is spreading compared to another in the same population, taking into account things like vaccination and previous infections.

The Challenge of Tracking Evolution

Traditionally, figuring out which variants are gaining ground involves tracking their frequency in genome surveillance data. We watch as a new variant shows up and starts becoming more common. This works, but it takes time. You need a significant number of sequences for a new variant before you can even start estimating its fitness. It’s like waiting for enough votes to be counted before you know who’s winning.

But what if we could predict a variant’s fitness *as soon as we find it*? Imagine the advantage! We could flag potentially high-risk variants much faster, giving us more time to prepare. This got us thinking: could we use the power of modern AI to crack this code?

Introducing CoVFit: Our Protein Language Model

That’s where CoVFit comes in. We developed this tool, which is essentially a protein language model – think of it like the AI models that generate text, but trained specifically on protein sequences. We adapted a state-of-the-art model called ESM-2 and taught it to predict SARS-CoV-2 variant fitness based *only* on the sequence of its S protein.

How did we teach it? Well, we used a few tricks. First, we gave it a crash course on coronavirus S proteins by showing it sequences from lots of different coronaviruses (a process called domain adaptation). Then, we used a technique called multitask learning, training it simultaneously on two types of data:

  • Genotype-fitness data: Real-world data showing the estimated fitness (relative Re) of thousands of different S protein variants collected from various countries.
  • Functional mutation data (DMS): Experimental data showing how individual mutations in the S protein affect the virus’s ability to escape neutralization by antibodies.

By learning from both the real-world spread data and the detailed functional effects of mutations, we hoped CoVFit could get a really nuanced understanding of what makes a variant fit.

A highly detailed, photorealistic macro shot of a SARS-CoV-2 virus particle, 105mm macro lens, high detail, precise focusing, controlled lighting.

Putting CoVFit to the Test

We were eager to see if CoVFit could actually predict the fitness of variants it hadn’t seen before. We set up experiments where we trained the model on data up to a certain date and then asked it to predict the fitness of variants that emerged *after* that date. This is the real challenge – extrapolating to the future.

And guess what? It did surprisingly well! CoVFit was able to predict the fitness ranking of future variants with informative accuracy, even those with a fair number of mutations (around 15 amino acid differences) compared to the training data. It significantly outperformed traditional statistical models and other non-deep learning methods we tested. This ability to look into the future, even a little bit, is a game-changer for surveillance.

We also found that including that functional mutation data (DMS) was absolutely critical for this extrapolation ability. It seems knowing how mutations affect immune escape gives the model a deeper understanding of the underlying biology, which helps it predict the impact of new combinations of mutations.

However, it wasn’t perfect. CoVFit struggled a bit with variants that had *very* large evolutionary jumps (like BA.2.86, which had over 30 mutations in the S protein compared to earlier variants). This tells us there are still limits to its generalization, especially when the protein context changes dramatically.

Mapping the Fitness Landscape

Beyond just predicting fitness, CoVFit allowed us to explore the SARS-CoV-2 fitness landscape – essentially, mapping out which genetic changes are likely to lead to increased spread. By analyzing the evolutionary tree of the virus and using CoVFit to estimate fitness at different points, we identified hundreds of instances where fitness significantly increased.

We pinpointed specific mutations that seem to be major drivers of fitness gains. Unsurprisingly, many of these are located in the Receptor Binding Domain (RBD) of the S protein, especially the part that directly interacts with our cells (the RBM). These mutations often help the virus escape antibodies.

Interestingly, we also saw examples of epistasis – where the effect of a mutation depends on the *other* mutations already present in the virus. For instance, the F456L mutation provided a big fitness boost in the XBB lineage but had a negative or neutral effect in other variants. CoVFit helped us see this context-specific effect, and published experimental data confirmed that F456L only avoids negative impacts on ACE2 binding and protein expression in the XBB background. This highlights how complex viral evolution can be!

A photorealistic image depicting complex data points and lines forming a network, suggesting a protein language model analyzing sequences, 60mm macro lens, high detail, precise focusing, controlled lighting.

Predicting the Next Move

Could we use CoVFit to predict the virus’s *next* evolutionary step? We tried an “in silico DMS” approach, computationally introducing every possible single amino acid change into a variant’s S protein and predicting the fitness gain using CoVFit.

We tested this on the BA.2.86.1 lineage, which later gave rise to the highly successful JN.1 variant via the L455S mutation. Our simulation, using a version of CoVFit trained *without* any knowledge of JN.1, predicted that mutations at sites like 455 and 456 would provide the biggest fitness gains. And guess what happened in the real world? L455S (site 455) was the first mutation to rapidly spread in that lineage (leading to JN.1), and F456L (site 456) became common later in its descendants (like KP.2 and KP.3). This suggests CoVFit can indeed help predict which mutations are most likely to emerge and spread.

Why This Matters Now

This isn’t just academic curiosity. As the pandemic continues, intensive genome surveillance is becoming harder to maintain globally. Traditional methods that rely on accumulating lots of sequences will become slower. CoVFit, by predicting fitness directly from a sequence, offers a way to get immediate risk assessments for new variants, even when sequencing data is sparser. It could be a vital tool for efficient surveillance, helping us focus resources on the variants that matter most.

And it’s not just for SARS-CoV-2. The principles behind CoVFit could be applied to other viruses that constantly evolve, like Influenza or RSV, especially in situations where surveillance data might be limited.

A photorealistic landscape wide angle 24mm shot, overlaid with abstract data visualizations representing a viral fitness landscape, sharp focus, long exposure.

Keeping it Real: Limitations

Now, it’s important to be upfront – CoVFit isn’t a crystal ball. It has limitations.

  • The fitness data it’s trained on comes from real-world surveillance, which isn’t perfect and can have biases.
  • The model used to estimate relative Re assumes fitness differences stay constant over time, which isn’t always true in a changing immune landscape.
  • Since variants with really bad mutations don’t spread, CoVFit likely underestimates the negative impact of some mutations because it doesn’t see many examples of them.
  • In the very early stages of a *new* pandemic, before much sequence data exists, CoVFit wouldn’t have enough information to train effectively.

Despite these points, we believe CoVFit represents a significant step forward. It provides a powerful new way to explore the complex world of viral evolution and helps us better anticipate the moves of these ever-changing pathogens. Tools like this, built on the wealth of data gathered during the SARS-CoV-2 pandemic, are crucial for preparing for whatever viral challenges the future might hold.

Source: Springer

Articoli correlati

Lascia un commento

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *