Sciweavers

WWW
2002
ACM

Parallel crawlers

14 years 11 months ago
Parallel crawlers
In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture. Keywords Web Crawler, Web Spider, Parallelization
Junghoo Cho, Hector Garcia-Molina
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2002
Where WWW
Authors Junghoo Cho, Hector Garcia-Molina
Comments (0)