Integrated gene and species phylogenies from unaligned whole genome protein sequences Journal Articles uri icon

  •  
  • Overview
  •  
  • Research
  •  
  • Identity
  •  
  • Additional Document Info
  •  
  • View All
  •  

abstract

  • Abstract Motivation: Most molecular phylogenies are based on sequence alignments. Consequently, they fail to account for modes of sequence evolution that involve frequent insertions or deletions. Here we present a method for generating accurate gene and species phylogenies from whole genome sequence that makes use of short character string matches not placed within explicit alignments. In this work, the singular value decomposition of a sparse tetrapeptide frequency matrix is used to represent the proteins of organisms uniquely and precisely as vectors in a high-dimensional space. Vectors of this kind can be used to calculate pairwise distance values based on the angle separating the vectors, and the resulting distance values can be used to generate phylogenetic trees. Protein trees so derived can be examined directly for homologous sequences. Alternatively, vectors defining each of the proteins within an organism can be summed to provide a vector representation of the organism, which is then used to generate species trees. Results: Using a large mitochondrial genome dataset, we have produced species trees that are largely in agreement with previously published trees based on the analysis of identical datasets using different methods. These trees also agree well with currently accepted phylogenetic theory. In principle, our method could be used to compare much larger bacterial or nuclear genomes in full molecular detail, ultimately allowing accurate gene and species relationships to be derived from a comprehensive comparison of complete genomes. In contrast to phylogenetic methods based on alignments, sequences that evolve by relative insertion or deletion would tend to remain recognizably similar. Availability: Both the program used to convert properly formatted sequence files into sparse n-gram matrices (aacode3) and the program used to generate PHYLIP compatible pairwise distance matrices from the Singular Value Decomposition (SVD) output (cosdist) are available at http://mama.indstate.edu/user/stuart. The SVD package is available at http://www.netlib.org/svdpack/index.html, and the PHYLIP package is available at http://evolution.genetics.washington.edu/phylip.html. Contact: G-Stuart@indstate.edu * To whom correspondence should be addressed.

publication date

  • January 1, 2002