Identifying and Filtering Near-Duplicate Documents

15 years 11 months ago

Download www.cs.princeton.edu

Abstract. The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a ﬁxed size “sketch” for each document. For a large collection of documents (say hundreds of millions) the size of this sketch is of the order of a few hundred bytes per document. However, for eﬃcient large scale web indexing it is not necessary to determine the actual resemblance value: it suﬃces to determine whether newly encountered documents are duplicates or near-duplicates of documents already indexed. In other words, it suﬃces to determine whether the resemblance is above a certain threshold. In this talk we show how this determination can be made using a ”sample” of less than 50 bytes per document. The basic approach for computing resemblance has two aspects: ﬁrst, resemblance is expressed as a set (of strings) intersection problem, and second, the relative size of intersections is evaluated by a process o...

Andrei Z. Broder

Real-time Traffic