Home
Scholarly Works
Evolution of N-Gram Frequencies Under Duplication...
Conference

Evolution of N-Gram Frequencies Under Duplication and Substitution Mutations

Abstract

The driving force behind the generation of biological sequences are genomic mutations that shape these sequences throughout their evolutionary history. An understanding of the statistical properties that result from mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analyzing them are challenging. In this paper, we study two types of mutations, tandem duplication and substitution. These play a critical role in forming tandem repeat regions, which are common features of the genome of many organisms. We provide a stochastic model and, via stochastic approximation, study the behavior of the frequencies of N- grams in resulting sequences. Specifically, we show that $N$-gram frequencies converge almost surely to a set which we identify as a function of model parameters. From these frequencies, other statistics can be derived. In particular, we present a method for finding upper bounds on entropy.

Authors

Lou H; Schwartz M; Hassanzadeh FF

Volume

00

Pagination

pp. 2246-2250

Publisher

Institute of Electrical and Electronics Engineers (IEEE)

Publication Date

June 17, 2018

DOI

10.1109/isit.2018.8437507

Name of conference

2018 IEEE International Symposium on Information Theory (ISIT)
View published work (Non-McMaster Users)

Contact the Experts team