Top-k Set Similarity Joins

16 years 9 months ago

Download www.cse.unsw.edu.au

Abstract-- Similarity join is a useful primitive operation underlying many applications, such as near duplicate Web page detection, data integration, and pattern recognition. Traditional similarity joins require a user to specify a similarity threshold. In this paper, we study a variant of the similarity join, termed top-k set similarity join. It returns the top-k pairs of records ranked by their similarities, thus eliminating the guess work users have to perform when the similarity threshold is unknown before hand. An algorithm, topk-join, is proposed to answer top-k similarity join efficiently. It is based on the prefix filtering principle and employs tight upper bounding of similarity values of unseen pairs. Experimental results demonstrate the efficiency of the proposed algorithm on large-scale real datasets.

Chuan Xiao, Wei Wang 0011, Xuemin Lin, Haichuan Sh

Real-time Traffic

Database | ICDE 2009 | Set Similarity Join | Similarity Join | Similarity Threshold | Similarity Values | Top-k Similarity Join |

claim paper

Post Info
More Details (n/a)

Added	20 Oct 2009
Updated	20 Oct 2009
Type	Conference
Year	2009
Where	ICDE
Authors	Chuan Xiao, Wei Wang 0011, Xuemin Lin, Haichuan Shang

Comments (0)

Sciweavers

Top-k Set Similarity Joins

Database | ICDE 2009 | Set Similarity Join | Similarity Join | Similarity Threshold | Similarity Values | Top-k Similarity Join |

Explore & Download

Productivity Tools

Sciweavers