Adaptive near-duplicate detection via similarity learning

15 years 10 months ago

Download research.microsoft.com

In this paper, we present a novel near-duplicate document detection method that can easily be tuned for a particular domain. Our method represents each document as a real-valued sparse k-gram vector, where the weights are learned to optimize for a speciﬁed similarity function, such as the cosine similarity or the Jaccard coefﬁcient. Near-duplicate documents can be reliably detected through this improved similarity measure. In addition, these vectors can be mapped to a small number of hash-values as document signatures through the locality sensitive hashing scheme for efﬁcient similarity computation. We demonstrate our approach in two target domains: Web news articles and email messages. Our method is not only more accurate than the commonly used methods such as Shingles and I-Match, but also shows consistent improvement across the domains, which is a desired property lacked by existing methods. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content...

Hannaneh Hajishirzi, Wen-tau Yih, Aleksander Kolcz

Real-time Traffic