Background: The advent of protein embeddings has revolutionized bioinformatics by providing contextual representations that capture functional and evolutionary patterns. They have become, alongside sequence alignments, the cornerstone of bioinformatics. Embeddings cannot replace alignments but they can greatly help improve their quality. While embedding-based improvements have been considered for global alignments, the more important counterpart, local alignments, has not been studied thoroughly. Our goal is to identify the most accurate local alignment algorithm for protein sequences.
Results: We introduce a new scoring function into our previous E-score algorithm by using Ankh embeddings. We prove that the resulting algorithm produces the most accurate local alignments of protein sequences using a new comprehensive framework that enables thorough evaluation of local alignment quality. We design a new algorithm for local alignment extraction, localization and quality evaluation and employ five distance metrics to evaluate the similarity with the true alignment. We also build multiple datasets, using both natural and inserted sequences, from the Conserved Domain Database, BAliBASE, and GPCRdb. We perform over two and a half million tests to compare the new algorithm with the best BLOSUM matrices, specialized GPCRtm matrices, and top programs, such as PEbA, DEDAL, vcMSA and pLM-BLAST. Our testing also reveals interesting insights into the behaviour of various protein language models as some of them perform much better on natural sequences compared to artificial ones obtained by inserting domains into random protein sequences. Also, while some models combine to produce better results, Ankh does not combine well with other embeddings.
Conclusions: The new, Ankh-score-based, program is clearly superior to all existing methods. New light shed on the protein embeddings can guide future improvements. In order to facilitate the use of the new method and protocol, they are freely available as a web server at e-score.csd.uwo.ca and as source code at github.com/lucian-ilie/E-score.