Practical KMP/BM Style Pattern-Matching on Indeterminate Strings
Abstract
In this paper we describe two simple, fast, space-efficient algorithms for finding all matches of an indeterminate pattern $p = p[1..m]$ in an indeterminate string $x = x[1..n]$, where both $p$ and $x$ are defined on a "small" ordered alphabet $Σ$ $-$ say, $σ= |Σ| \le 9$. Both algorithms depend on a preprocessing phase that replaces $Σ$ by an integer alphabet $Σ_I$ of size $σ_I = σ$ which (reversibly, in time linear in string length) maps both $x$ and $p$ into equivalent regular strings $y$ and $q$, respectively, on $Σ_I$, whose maximum (indeterminate) letter can be expressed in a 32-bit word (for $σ\le 4$, thus for DNA sequences, an 8-bit representation suffices). We first describe an efficient version KMP Indet of the venerable Knuth-Morris-Pratt algorithm to find all occurrences of $q$ in $y$ (that is, of $p$ in $x$), but, whenever necessary, using the prefix array, rather than the border array, to control shifts of the transformed pattern $q$ along the transformed string $y$. We go on to describe a similar efficient version BM Indet of the Boyer- Moore algorithm that turns out to execute significantly faster than KMP Indet over a wide range of test cases. A noteworthy feature is that both algorithms require very little additional space: $Θ(m)$ words. We conjecture that a similar approach may yield practical and efficient indeterminate equivalents to other well-known pattern-matching algorithms, in particular the several variants of Boyer-Moore.