Due to resource constraints, search engines usually have difficulties keeping the local database completely synchronized with the Web. To detect as many changes as possible, the crawler used by a search engine should be able to predict the change behavior of webpages so that it can use the limited resource to download those webpages that are most likely to change. Towards this goal, we propose using sampling approach at the level of a cluster. We first group all the local webpages into different clusters such that each cluster contains webpages with similar change patterns. We then sample webpages from each cluster to estimate the change frequency of all the webpages in that cluster, and the cluster containing webpages with higher change frequency will be revisited more often by our crawler. We run extensive experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 websites. The results show that by applying our clustering algorithm, pages with similar...
Qingzhao Tan, Ziming Zhuang, Prasenjit Mitra, C. L