Do protein language models understand evolution? Mixed evidence from ancestral sequences and ESM2

Evan Kiefl; Isabel Nocedal; Ryan York

doi:10.57844/arcadia-5cwu-spn8

Published on Jul 30, 2025 by Arcadia Science

Do protein language models understand evolution? Mixed evidence from ancestral sequences and ESM2

Protein language models are trained on evolutionarily related sequences, yet the extent to which they capture the underlying evolutionary relationships remains unclear. We explore this question using reconstructed ancestral protein sequences and the ESM2 protein language model.

Do protein language models understand evolution? Mixed evidence from ancestral sequences and ESM2

Purpose

We wondered how protein language models trained on extant sequences would interpret plausible ancestral sequences. To explore this, we used ESM2 to evaluate maximum likelihood ancestral sequence reconstructions for two example gene families. We found that ESM2 often finds these ancestral sequences more plausible than extant descendants, and can distinguish between crude consensus ancestral sequences and more sophisticated maximum likelihood reconstructions. However, these patterns are context- and model-dependent, suggesting that further investigation is needed to determine which evolutionary relationships are truly captured by large protein language models like ESM2.

We are sharing these results to encourage further exploration of how large foundation models interpret the evolutionary relationships embedded in their training data. We think this could be useful to researchers interested in interrogating the implicit learning in large protein language models, like ESM2, and we propose that ancestral sequences offer a useful tool for this purpose.

View the notebook

The full pub is available here.
The source code to generate it is available in this GitHub repo (DOI: 10.5281/zenodo.16620544).

In the future, we hope to host notebook pubs directly on our publishing platform. Until that’s possible, we’ll create stubs like this with key metadata like the DOI, author roles, citation information, and an external link to the pub itself.

Pub details

Content 3 contributors

0 references

Activity 2 discussions

0 social posts

This work is licensed under CC BY 4.0

Evan Kiefl

Validation

Isabel Nocedal

Conceptualization, Formal Analysis, Software, Visualization, Writing

Ryan York

Editing, Supervision

Maja on Oct 16, 2025

This is a fascinating study exploring whether protein language models truly grasp evolutionary patterns or just surface-level correlations. I appreciate the use of reconstructed ancestral sequences to test this idea. It’s a clever and timely approach. I’m curious how sensitive your findings are to the assumptions behind ancestral sequence reconstruction, such as substitution models and rate heterogeneity. It also seems possible that ESM2 is capturing functional or structural constraints rather than genuine evolutionary relationships; have you looked into separating those effects? I’d be interested to see how performance changes with increasing sequence divergence, since many models excel at interpolation but struggle with extrapolation. Finally, it would be great to compare ESM2’s insights directly with traditional phylogenetic methods to see where they align or diverge. Overall, this work opens up important questions about how deeply these models understand biology versus mimic statistical regularities.

Ryan York on Oct 20, 2025

Thank you for your comment!

You are correct in pointing out the influence of ASR assumptions. ASR is a powerful but tricky method; it's almost impossible to identify "ground truth" and reconstructions are limited to the variance observed among extant sequences. We think that's fine, though, for the point we are making here. ASR is capable of generating plausible ancestral sequences that, as we find, look different than simple consensus reconstruction. These have allowed some initial interesting insights into ESM2. It would definitely be interesting to explore the interaction of the different confounds unique to ASR models and pLMs in the reconstruction of evolutionary history.

Identifying the extent to which pLMs capture real evolutionary relationships and how they compare to phylogenetic methods is a bit beyond the scope of this pub. There's also a growing literature around these questions that I highly recommend exploring (a good example is the recent model Phyla).

Contributors (A-Z)

Purpose

View the notebook

References

Pub details