Efficient similarity joins for near duplicate detection

15 years 1 months ago

Download www2008.org

With the increasing amount of data and the need to integrate data from multiple data sources, a challenging issue is to find near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pairs of records such that their similarities are above a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. Experimental results show that our proposed algorithms can achieve up to 2.6x?5x speed-up over previous algorithms on several real datasets and provide alternative solutions to the near duplicate Web page detection problem. Categories and Subject Descriptors: H.3.3 [Information Search and Retrieval]: Search Process, Clustering General Terms: Algorithms, Performance

Chuan Xiao, Wei Wang 0011, Xuemin Lin, Jeffrey Xu

Real-time Traffic

Efficient Algorithms | Internet Technology | Prefix Filtering Principle | Several Existing Algorithms | WWW 2008 |

claim paper

Post Info
More Details (n/a)

Added	21 Nov 2009
Updated	21 Nov 2009
Type	Conference
Year	2008
Where	WWW
Authors	Chuan Xiao, Wei Wang 0011, Xuemin Lin, Jeffrey Xu Yu

Comments (0)

Sciweavers

Efficient similarity joins for near duplicate detection

Efficient Algorithms | Internet Technology | Prefix Filtering Principle | Several Existing Algorithms | WWW 2008 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers