Matching regular expressions (regexps) is a very common workload. For example, tokenization, which consists of recognizing words or keywords in a character stream, appears in every search engine indexer. Tokenization also consumes 30% or more of most XML processors’ execution time and represents the first stage of any programming language compiler. Despite the multi-core revolution, regexp scanner generators like flex haven’t changed much in 20 years, and they do not exploit the power of recent multi-core architectures (e.g., multiple threads and wide SIMD units). This is unfortunate, especially given the pervasive importance of search engines and the fast growth of our digital universe. Indexing such data volumes demands precisely the processing power that multi-cores are designed to offer. We present an algorithm and a set of techniques for using multicore features such as multiple threads and SIMD instructions to perform parallel regexp-based tokenization. As a proof of conce...
Daniele Paolo Scarpazza, Gregory F. Russell