Sciweavers

SIGMOD
2008
ACM

Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

14 years 11 months ago
Cost-based variable-length-gram selection for string collections to support approximate queries efficiently
Approximate queries on a collection of strings are important in many applications such as record linkage, spell checking, and Web search, where inconsistencies and errors exist in data as well as queries. Several existing algorithms use the concept of "grams," which are substrings of strings used as signatures for the strings to build index structures. A recently proposed technique, called VGRAM, improves the performance of these algorithms by using a carefully chosen dictionary of variable-length grams based on their frequencies in the string collection. Since an index structure using fixed-length grams can be viewed as a special case of VGRAM, a fundamental problem arises naturally: what is the relationship between the gram dictionary and the performance of queries? We study this problem in this paper. We propose a dynamic programming algorithm for computing a tight lower bound on the number of common grams shared by two similar strings in order to improve query performanc...
Xiaochun Yang, Bin Wang, Chen Li
Added 08 Dec 2009
Updated 08 Dec 2009
Type Conference
Year 2008
Where SIGMOD
Authors Xiaochun Yang, Bin Wang, Chen Li
Comments (0)