Sciweavers

WWW
2006
ACM

Effective web-scale crawling through website analysis

15 years 1 months ago
Effective web-scale crawling through website analysis
The web crawler space is often delimited into two general areas: full-web crawling and focused crawling. We present netSifter, a crawler system which integrates features from these two areas to provide an effective mechanism for webscale crawling. netSifter utilizes a combination of page-level analytics and heuristics which are applied to a sample of web pages from a given website. These algorithms score individual web pages to determine the general utility of the overall website. In doing so, netSifter can formulate an indepth opinion of a website (and the entirety of its web pages) with a relative minimum of work. netSifter is then able to bias the future efforts of its crawl towards higher quality websites, and away from the myriad of low quality websites and crawler traps that litter the World Wide Web. Categories and Subject Descriptors D.2.11 [Software]: Software Architecture; H.2 [Information Systems]: Information Storage and Retrieval General Terms Performance, Design Keywords...
Iván Gonzlez, Adam Marcus 0002, Daniel N. M
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2006
Where WWW
Authors Iván Gonzlez, Adam Marcus 0002, Daniel N. Meredith, Linda A. Nguyen
Comments (0)