Intelligent crawling on the World Wide Web with arbitrary predicates

16 years 8 months ago

Download www10.org

The enormous growth of the world wide web in recent years has made it important to perform resource discovery e ciently. Consequently, several new ideas have been proposed in recent years among thema key technique is focused crawling which is able to crawl particular topical portions of the world wide web quickly without having to explore all web pages. In this paper, we propose the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the world wide web while performing the crawling. Speci cally, the intelligent crawler uses the inlinking web page content, candidate URL structure, or other behaviors of the inlinking web pages or siblings in order to estimate the probability that a candidate is useful for a given crawl. This is a much more general framework than the focused crawling technique which is based on a pre-de ned understanding of the topical structure of the web. The techniques discussed in this paper are applicable for crawl...

Charu C. Aggarwal, Fatima Al-Garawi, Philip S. Yu

Real-time Traffic