

Automatic retrieval of similar content using search engine query interface

14 years 7 months ago
Automatic retrieval of similar content using search engine query interface
We consider the coverage testing problem where we are given a document and a corpus with a limited query interface and asked to find if the corpus contains a near-duplicate of the document. This problem has applications in search engines for competitive coverage testing. To solve this problem, we propose approaches that work in three main steps: generate a query signature from the document, query the corpus using the query signature and scrape the returned results, and validate the similarity between the input document and the returned results. We discuss techniques to control and bound the performance of these methods. We perform largescale experimental validation and show that these methods perform well across different search engine corpora and documents in multiple languages. They also are robust against performance parameter variations. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Search Process General Terms Algorithms, Experimentation, Measurem...
Ali Dasdan, Paolo D'Alberto, Santanu Kolay, Chris
Added 26 May 2010
Updated 26 May 2010
Type Conference
Year 2009
Where CIKM
Authors Ali Dasdan, Paolo D'Alberto, Santanu Kolay, Chris Drome
Comments (0)