Detecting near-duplicates for web crawling

15 years 4 days ago

Download infolab.stanford.edu

Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar's fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing fbit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design. Categories and Subject Descriptors E.1 [Data Structures...

Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma

Real-time Traffic

Internet Technology | Near-duplicate Web Documents | Web Crawler Increases | Web Search | WWW 2007 |

claim paper

Post Info
More Details (n/a)

Added	22 Nov 2009
Updated	22 Nov 2009
Type	Conference
Year	2007
Where	WWW
Authors	Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma

Comments (0)

Sciweavers

Detecting near-duplicates for web crawling

Internet Technology | Near-duplicate Web Documents | Web Crawler Increases | Web Search | WWW 2007 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers