Statistical detection of co-occurring genes across genomes, known as ‘phylogenetic profiling’, is a powerful bioinformatic technique for inferring gene–gene functional associations. However, this can be a challenging task given the size and complexity of phylogenomic databases, difficulty in accounting for phylogenetic structure, inconsistencies in genome annotation and substantial computational requirements.
We introduce PhyloCorrelate—a computational framework for gene co-occurrence analysis across large phylogenomic datasets. PhyloCorrelate implements a variety of co-occurrence metrics including standard correlation metrics and model-based metrics that account for phylogenetic history. By combining multiple metrics, we developed an optimized score that exhibits a superior ability to link genes with overlapping GO terms and KEGG pathways, enabling gene function prediction. Using genomic and functional annotation data from the Genome Taxonomy Database and AnnoTree, we performed all-by-all comparisons of gene occurrence profiles across the bacterial tree of life, totaling 154 217 052 comparisons for 28 315 genes across 27 372 bacterial genomes. All predictions are available in an online database, which instantaneously returns the top correlated genes for any PFAM, TIGRFAM or KEGG query. In total, PhyloCorrelate detected 29 762 high confidence associations between bacterial gene/protein pairs, and generated functional predictions for 834 DUFs and proteins of unknown function.
PhyloCorrelate is available as a web-server at phylocorrelate.uwaterloo.ca as well as an R package for analysis of custom datasets. We anticipate that PhyloCorrelate will be broadly useful as a tool for predicting function and interactions for gene families.
Supplementary data are available at Bioinformatics online.