Near-duplicate detection by instance-level constrained clustering

16 years 1 months ago

Download www.cs.cmu.edu

For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of near-duplicated documents are different from that of both “almost-identical” documents in the data cleaning task and “relevant” documents in the search task. This paper presents an instance-level constrained clustering approach for near-duplicate detection. The framework incorporates information such as document attributes and content structure into the clustering process to form near-duplicate clusters. Gathered from several collections of public comments sent to U.S. government agencies on proposed new regulations, the experimental results demonstrate that our approach outperforms other near-duplicate detection algorithms and as about as effective as human assessors. Categories and Subject Descrip...

Hui Yang, James P. Callan

Real-time Traffic

Document | Near-duplicate Detection | Near-duplicated Documents | SIGIR 2006 |

claim paper

Post Info
More Details (n/a)

Added	14 Jun 2010
Updated	14 Jun 2010
Type	Conference
Year	2006
Where	SIGIR
Authors	Hui Yang, James P. Callan

Comments (0)

Sciweavers

Near-duplicate detection by instance-level constrained clustering

Document | Near-duplicate Detection | Near-duplicated Documents | SIGIR 2006 |

Explore & Download

Productivity Tools

Sciweavers