Sciweavers

WEBDB
2007
Springer

A clustering-based sampling approach for refreshing search engine's database

14 years 5 months ago
A clustering-based sampling approach for refreshing search engine's database
Due to resource constraints, search engines usually have difficulties keeping the local database completely synchronized with the Web. To detect as many changes as possible, the crawler used by a search engine should be able to predict the change behavior of webpages so that it can use the limited resource to download those webpages that are most likely to change. Towards this goal, we propose using sampling approach at the level of a cluster. We first group all the local webpages into different clusters such that each cluster contains webpages with similar change patterns. We then sample webpages from each cluster to estimate the change frequency of all the webpages in that cluster, and the cluster containing webpages with higher change frequency will be revisited more often by our crawler. We run extensive experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 websites. The results show that by applying our clustering algorithm, pages with similar...
Qingzhao Tan, Ziming Zhuang, Prasenjit Mitra, C. L
Added 09 Jun 2010
Updated 09 Jun 2010
Type Conference
Year 2007
Where WEBDB
Authors Qingzhao Tan, Ziming Zhuang, Prasenjit Mitra, C. Lee Giles
Comments (0)