There are many emerging database applications that require accurate selectivity estimation of approximate string matching queries. Edit distance is one of the most commonly used string similarity measures. In this paper, we study the problem of estimating selectivity of string matching with low edit distance. Our framework is based on extending q-grams with wildcards. Based on the concepts of replacement semilattice, string hierarchy and a combinatorial analysis, we develop the formulas for selectivity estimation and provide the algorithm BasicEQ. We next develop the algorithm OptEQ by enhancing BasicEQ with two novel improvements. Finally we show a comprehensive set of experiments using three benchmarks comparing OptEQ with the stateof-the-art method SEPIA. Our experimental results show that OptEQ delivers more accurate selectivity estimations.
Hongrae Lee, Raymond T. Ng, Kyuseok Shim