Home
Scholarly Works
Gene function prediction using an AnnoTree-based...
Conference

Gene function prediction using an AnnoTree-based genomic language model

Abstract

Large tree-of-life (ToL) scale databases of microbial genomes are powerful resources for exploring genome structure and function from a phylogenomic context. Despite the growing availability of genomic data, large-scale genome annotation is still a challenge, with a considerable fraction of genes remaining as unannotated. Here, to expand the capabilities of database-wide gene function prediction, we used our AnnoTree platform as a corpus for training a Word2vec-based genomic language model (gLM). Machine-learning of genomic grammar patterns across the AnnoTree database revealed functional associations between genes and enabled the inference of function for hypothetical proteins and domains, as we demonstrated by predicting novel type VI secretion proteins. Finally, we implemented a web-server to allow users to interact with the AnnoTree Word2vec model, thus facilitating gene function prediction. Ultimately, our work highlighted the GTDB/AnnoTree database as a powerful training database for gLMs focused on prediction and discovery of microbial gene functions.

Authors

Tan H; Moreno-Hagelsieb G; Doxey AC

Volume

00

Pagination

pp. 5120-5126

Publisher

Institute of Electrical and Electronics Engineers (IEEE)

Publication Date

December 6, 2024

DOI

10.1109/bibm62325.2024.10822432

Name of conference

2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
View published work (Non-McMaster Users)

Contact the Experts team