Séminaire de Probabilités et Statistique
Monday 16 June 2025 à 13:45 - UM - Bât 09 - Salle 109 (1er étage)
Thomas Minotto (IMAG - Université de Montpellier)
Exploring homology detection via k-means clustering of proteins embedded with a large language model
Inferring protein homology from sequence information is essential for understanding species evolution and enabling functional annotation transfer. Besides similarity-based methods, several machine learning approaches have been developed using various ways of representing protein data. Here, we represent proteins with a biologically oriented large language model, and apply k-means clustering to the embedded data to extract homology relationships. Although our approach lacks the sensitivity of other tools, we obtain better accuracy for the detection of n:m orthologs. Furthermore, we successfully reconstruct full orthologous groups from scratch, highlighting the growing potential of using large language models in combination with clustering algorithms for the analysis of protein data. This presentation is not meant to be too technical, and a focus will be made on the biological aspects of the problem.
Joint work with Thomas D. Otto (University of Glasgow) and Antoine Claessens (LPHI, University of Montpellier).
Séminaire en salle 109, également retransmis sur zoom : https://umontpellier-fr.zoom.us/j/7156708132