Sciweavers

SIGIR
2004
ACM

Constructing a text corpus for inexact duplicate detection

14 years 5 months ago
Constructing a text corpus for inexact duplicate detection
As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its negative effect on search results. Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling inexact duplicate documents. Categories and Subject Descriptors H.2.4 [Information Systems]: Database Management— Systems–Textual Databases; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Selection Process; H.3.m [Information Storage and Retrieval]: Miscellaneous—Test Collections General Terms Experimentation, Measurement, Design, Algorithms Keywords test collections, duplicate document detection
Jack G. Conrad, Cindy P. Schriber
Added 30 Jun 2010
Updated 30 Jun 2010
Type Conference
Year 2004
Where SIGIR
Authors Jack G. Conrad, Cindy P. Schriber
Comments (0)