On the Evolution of Clusters of Near-Duplicate Web Pages

16 years 5 days ago

Download research.microsoft.com

This paper expands on a 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basis over the span of 11 weeks. We then determined which of these pages are near-duplicates of one another, and tracked how clusters of near-duplicate documents evolved over time. We found that 29.2% of all web pages are verysimilar to other pages, and that 22.2% are virtually identical to other pages. We also found that clusters of near-duplicate documents are fairly stable: Two documents that are near-duplicates of one another are very likely to still be near-duplicates 10 weeks later. This result is of signiﬁcant relevance to search engines: Web crawlers can be fairly conﬁdent that two pages that have been found to be near-duplicates of one another will continue to be so for the foreseeable future, and may thus decide to recrawl only one version of that page, or at least to lower the download priority of the o...

Dennis Fetterly, Mark Manasse, Marc Najork

Real-time Traffic

Human Computer Interaction | Internet Technology | LAWEB 2003 | Near-duplicate Documents | Pages | Web Pages |

claim paper

Added	05 Jul 2010
Updated	05 Jul 2010
Type	Conference
Year	2003
Where	LAWEB
Authors	Dennis Fetterly, Mark Manasse, Marc Najork

Sciweavers

On the Evolution of Clusters of Near-Duplicate Web Pages

Human Computer Interaction | Internet Technology | LAWEB 2003 | Near-duplicate Documents | Pages | Web Pages |

Explore & Download

Productivity Tools

Sciweavers