Trie-Join: Efficient Trie-based String Similarity Joins with Edit-Distance Constraints

15 years 23 days ago

Download www.comp.nus.edu.sg

A string similarity join finds similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and have the following disadvantages: (1) They are inefficient for the data sets with short strings (the average string length is no larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel framework called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find the similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be ...

Jiannan Wang, Guoliang Li, Jianhua Feng

Real-time Traffic

Data Sets | PVLDB 2010 | Short Strings | String |

claim paper

Added	20 May 2011
Updated	20 May 2011
Type	Journal
Year	2010
Where	PVLDB
Authors	Jiannan Wang, Guoliang Li, Jianhua Feng

Sciweavers

Trie-Join: Efficient Trie-based String Similarity Joins with Edit-Distance Constraints

Data Sets | PVLDB 2010 | Short Strings | String |

Explore & Download

Productivity Tools

Sciweavers