The web crawler space is often delimited into two general areas: full-web crawling and focused crawling. We present netSifter, a crawler system which integrates features from these two areas to provide an effective mechanism for webscale crawling. netSifter utilizes a combination of page-level analytics and heuristics which are applied to a sample of web pages from a given website. These algorithms score individual web pages to determine the general utility of the overall website. In doing so, netSifter can formulate an indepth opinion of a website (and the entirety of its web pages) with a relative minimum of work. netSifter is then able to bias the future efforts of its crawl towards higher quality websites, and away from the myriad of low quality websites and crawler traps that litter the World Wide Web. Categories and Subject Descriptors D.2.11 [Software]: Software Architecture; H.2 [Information Systems]: Information Storage and Retrieval General Terms Performance, Design Keywords...
Iván Gonzlez, Adam Marcus 0002, Daniel N. M