Graph-based seed selection for web-scale crawlers

16 years 18 days ago

Download clgiles.ist.psu.edu

One of the most important steps in web crawling is determining the starting points, or seed selection. This paper identiﬁes and explores the problem of seed selection in webscale incremental crawlers. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a repository with more “good” and less “bad” pages. We propose a graph-based framework for crawler seed selection, and present several algorithms within this framework. Evaluation on real web data showed signiﬁcant improvements over heuristic seed selection approaches. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Algorithms, Design, Experimentation, Performance Keywords Crawler, Seed Selection, PageRank, Graph Analysis

Shuyi Zheng, Pavel Dmitriev, C. Lee Giles

Real-time Traffic

CIKM 2009 | Crawler Seed Selection | Database | Seed Selection | Seed Selection Approaches |

claim paper

Post Info
More Details (n/a)

Added	26 May 2010
Updated	26 May 2010
Type	Conference
Year	2009
Where	CIKM
Authors	Shuyi Zheng, Pavel Dmitriev, C. Lee Giles

Comments (0)

Sciweavers

Graph-based seed selection for web-scale crawlers

CIKM 2009 | Crawler Seed Selection | Database | Seed Selection | Seed Selection Approaches |

Explore & Download

Productivity Tools

Sciweavers