As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. The goal of this work is to facilitate (a) investigations into the phenomenon of near duplicates and (b) algorithmic approaches to minimizing its negative effect on search results. Harnessing the expertise of both client-users and professional searchers, we establish principled methods to generate a test collection for identifying and handling inexact duplicate documents. Categories and Subject Descriptors H.2.4 [Information Systems]: Database Management— Systems–Textual Databases; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Selection Process; H.3.m [Information Storage and Retrieval]: Miscellaneous—Test Collections General Terms Experimentation, Measurement, Design, Algorithms Keywords test collections, duplicate document detection
Jack G. Conrad, Cindy P. Schriber