Sciweavers

CPM
2000
Springer

Identifying and Filtering Near-Duplicate Documents

14 years 4 months ago
Identifying and Filtering Near-Duplicate Documents
Abstract. The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size “sketch” for each document. For a large collection of documents (say hundreds of millions) the size of this sketch is of the order of a few hundred bytes per document. However, for efficient large scale web indexing it is not necessary to determine the actual resemblance value: it suffices to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed. In other words, it suffices to determine whether the resemblance is above a certain threshold. In this talk we show how this determination can be made using a ”sample” of less than 50 bytes per document. The basic approach for computing resemblance has two aspects: first, resemblance is expressed as a set (of strings) intersection problem, and second, the relative size of intersections is evaluated by a process o...
Andrei Z. Broder
Added 02 Aug 2010
Updated 02 Aug 2010
Type Conference
Year 2000
Where CPM
Authors Andrei Z. Broder
Comments (0)