Home
Scholarly Works
Simple KMP Pattern-Matching on Indeterminate...
Conference

Simple KMP Pattern-Matching on Indeterminate Strings

Abstract

In this paper we describe a simple, fast, space-efficient approach to finding all matches of an indeterminate pattern p = p[1..m] in an indeterminate string x = x[1..n], where both p and x are defined on a “small” ordered alphabet Σ — say, σ = |Σ| ≤ 9. A preprocessing phase replaces Σ by an integer alphabet ΣI of size σI = σ that (reversibly, in time linear in string length) maps both x and p into equivalent regular strings y and q, respectively, on ΣI, whose maximum (indeterminate) letter can be expressed in a 32-bit word (for σ ≤ 4, thus for DNA sequences, an 8-bit representation suffices). We then describe an efficient version KMP Indet of the venerable Knuth-Morris-Pratt algorithm to find all occurrences of q in y (that is, of p in x), but, whenever necessary, using the prefix array, rather than the border array, to control shifts of the transformed pattern q along the transformed string y. Although requiring O(m2n) time in the theoretical worst case, in cases of practical interest KMP Indet executes in O(n) time. A noteworthy feature is the very small additional space requirement: Θ(m) words in all cases. We conjecture that a similar approach may yield practical and efficient indeterminate equivalents to other well-known pattern-matching algorithms, especially Boyer-Moore and its variants.

Authors

Mhaskar N; Smyth WF

Pagination

pp. 125-133

Publication Date

January 1, 2020

Conference proceedings

Proceedings of the Prague Stringology Conference 2020 Psc 2020

Contact the Experts team