Online duplicate document detection: signature reliability in a dynamic retrieval environment

15 years 12 months ago

Download www.conradweb.org

As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close matches. Our goal in this work is to investigate the phenomenon and determine one or more approaches that minimize its impact on search results. Recent work has focused on using some form of signature to characterize a document in order to reduce the complexity of document comparisons. A representative technique constructs a ‘ﬁngerprint’ of the rarest or richest features in a document using collection statistics as criteria for feature selection. One of the challenges of this approach, however, arises from the fact that in production environments, collections of documents are always changing, with new documents, or new versions of documents, arriving frequently, and other documents periodically removed. W...

Jack G. Conrad, Xi S. Guo, Cindy P. Schriber

Real-time Traffic

CIKM 2003 | Document | Online Document Collections | Training Collections |

claim paper

Added	06 Jul 2010
Updated	06 Jul 2010
Type	Conference
Year	2003
Where	CIKM
Authors	Jack G. Conrad, Xi S. Guo, Cindy P. Schriber

Sciweavers

Online duplicate document detection: signature reliability in a dynamic retrieval environment

CIKM 2003 | Document | Online Document Collections | Training Collections |

Explore & Download

Productivity Tools

Sciweavers