An Overview of String Processing Applications to...

An Overview of String Processing Applications to Data Analytics

Abstract

Data analytics may conveniently be divided into four stages: preparation, preprocessing, analysis, and post-processing. Especially in the second and third of these, where the data is cleaned, filtered and analyzed, string processing algorithms are fundamental. Applicable string methodology especially includes pattern matching (dozens of competing algorithms) and algorithms that compute repetitions and other forms of regularity. These are supported by powerful data structures (suffix array, prefix table, Burrows-Wheeler Transform, Lyndon array, and many others), developed and refined over the last 50 years. In this paper we provide an overview of three central methodological areas: •pattern matching;•repetitions (of both adjacent and non-adjacent repeating substrings);•string covering and compression. Each of these methodologies deals with both exact and approximate matches in the data provided. We outline several current applications to data analytics, in particular bioinformatics, information security and image analysis — all of them therefore positioned for future extension as string methodologies continue their rapid development. pattern matching; repetitions (of both adjacent and non-adjacent repeating substrings); string covering and compression.