Achieving both high precision and high recall in near-duplicate detection

14 years 2 months ago

Download www.infomall.cn

To find near-duplicate documents, fingerprint-based paradigms such as Broder's shingling and Charikar's simhash algorithms have been recognized as effective approaches and are considered the state-of-the-art. Nevertheless, we see two aspects of these approaches which may be improved. First, high score under these algorithms' similarity measurement implies high probability of similarity between documents, which is different from high similarity of the documents. But how similar two documents are is what we really need to know. Second, there has to be a tradeoff between hash-code length and hash-code multiplicity in fingerprint paradigms, which makes it hard to maintain a satisfactory recall level while improving precision. In this paper our contributions are two-folded. First, we propose a framework for implementing the longest common subsequence (LCS) as a similarity measurement in reasonable computing time, which leads to both high precision and recall. Second, we pres...

Lian'en Huang, Lei Wang, Xiaoming Li

Real-time Traffic

Charikar's Simhash Algorithms | CIKM 2008 | Information Management | Similarity Measurement | Web Pages |

claim paper

Post Info
More Details (n/a)

Added	12 Oct 2010
Updated	12 Oct 2010
Type	Conference
Year	2008
Where	CIKM
Authors	Lian'en Huang, Lei Wang, Xiaoming Li

Comments (0)

Sciweavers

Achieving both high precision and high recall in near-duplicate detection

Charikar's Simhash Algorithms | CIKM 2008 | Information Management | Similarity Measurement | Web Pages |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers