On the Complexity of Finding Approximate LCS of...

On the Complexity of Finding Approximate LCS of Multiple Strings

Abstract

Finding an Approximate Longest Common Substring (ALCS) within a given set $S=\{s_1,s_2,\ldots,s_m\}$ of $m \ge 2$ strings is a key problem in computational biology, such as identifying related mutations across multiple genetic sequences. We study several variants of ALCS problems that, given integers $k$ and $t \le m$, seek the longest string $u$ -- or the longest substring $u$ of any string in $S$ -- that lies within distance $k$ of at least one substring in $t$ distinct strings from $S$. While the general problems are NP-hard, we present efficient algorithms for restricted cases under Hamming and edit distances using the $LCP_k$ and $k$-errata tree data structures. Our methods achieve run times of $\mathcal{O}(N^2)$, $\mathcal{O}(k\ell N^2)$, and $\mathcal{O}(mN\log^k \ell)$, where $\ell$ is the length of the longest string and $N$ is the sum of the lengths of all the strings in $S$. We also establish conditional lower bounds under the Strong Exponential Time Hypothesis and extend our study to indeterminate strings.