Evolution of $k$-mer Frequencies and Entropy in Duplication and
Substitution Mutation Systems
Abstract
Genomic evolution can be viewed as string-editing processes driven by
mutations. An understanding of the statistical properties resulting from these
mutation processes is of value in a variety of tasks related to biological
sequence data, e.g., estimation of model parameters and compression. At the
same time, due to the complexity of these processes, designing tractable
stochastic models and analyzing them are challenging. In this paper, we study
two kinds of systems, each representing a set of mutations. In the first
system, tandem duplications and substitution mutations are allowed and in the
other, interspersed duplications. We provide stochastic models and, via
stochastic approximation, study the evolution of substring frequencies for
these two systems separately. Specifically, we show that $k$-mer frequencies
converge almost surely and determine the limit set. Furthermore, we present a
method for finding upper bounds on entropy for such systems.