

Detecting Near-replicas on the Web by Content and Hyperlink Analysis

15 years 3 months ago
Detecting Near-replicas on the Web by Content and Hyperlink Analysis
The presence of replicas or near-replicas of documents is very common on the Web. Documents may be replicated completely or partially for different reasons (versions, mirrors, etc.), or the same resource can be associated to different URLs (aliases, dynamically generated pages, etc.). Whilst replication can improve information accessibility by the users, the presence of near-replicated documents can hinder the effectiveness of search engines. For example, users would be annoyed by the presence of many similar pages in the result list in response to a query to a search engine. We propose a method to detect similar pages, in particular replicas and near-replicas, which is based on a pair of signatures. Both signatures are low dimensional vectors in order to reduce the computational costs for comparings pairs of documents. The first signature is obtained by a random projection of the bag-of-words vector representing the page contents. The second signature, referred to as Hypelink Map, is...
Ernesto Di Iorio, Michelangelo Diligenti, Marco Go
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2003
Where WWW
Authors Ernesto Di Iorio, Michelangelo Diligenti, Marco Gori, Marco Maggini, Augusto Pucci
Comments (0)