Sciweavers

CIKM
2003
Springer

Online duplicate document detection: signature reliability in a dynamic retrieval environment

14 years 4 months ago
Online duplicate document detection: signature reliability in a dynamic retrieval environment
As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close matches. Our goal in this work is to investigate the phenomenon and determine one or more approaches that minimize its impact on search results. Recent work has focused on using some form of signature to characterize a document in order to reduce the complexity of document comparisons. A representative technique constructs a ‘fingerprint’ of the rarest or richest features in a document using collection statistics as criteria for feature selection. One of the challenges of this approach, however, arises from the fact that in production environments, collections of documents are always changing, with new documents, or new versions of documents, arriving frequently, and other documents periodically removed. W...
Jack G. Conrad, Xi S. Guo, Cindy P. Schriber
Added 06 Jul 2010
Updated 06 Jul 2010
Type Conference
Year 2003
Where CIKM
Authors Jack G. Conrad, Xi S. Guo, Cindy P. Schriber
Comments (0)