Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

15 years 6 months ago

Download www.cse.unsw.edu.au

There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with edit distance constraints. Existing approaches are mainly based on converting the edit distance constraint to a weaker constraint on the number of matching q-grams between pair of strings. In this paper, we propose the novel perspective of investigating mismatching q-grams. Technically, we derive two new edit distance lower bounds by analyzing the locations and contents of mismatching q-grams. A new algorithm, EdJoin, is proposed that exploits the new mismatch-based filtering methods; it achieves substantial reduction of the candidate sizes and hence saves computation time. We demonstrate experimentally that the new algorithm outperforms alternative methods on large-scale real datasets unde...

Chuan Xiao, Wei Wang 0011, Xuemin Lin

Real-time Traffic

Algorithm | Distance Lower Bounds | Edit Distance Constraint | PVLDB 2008 |

claim paper

» Efficient approximate entity extraction with edit distance constraints

» Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data

Post Info
More Details (n/a)

Added	28 Dec 2010
Updated	28 Dec 2010
Type	Journal
Year	2008
Where	PVLDB
Authors	Chuan Xiao, Wei Wang 0011, Xuemin Lin

Comments (0)

Sciweavers

Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Algorithm | Distance Lower Bounds | Edit Distance Constraint | PVLDB 2008 |

Explore & Download

Productivity Tools

Sciweavers