A clustering-based sampling approach for refreshing search engine's database

14 years 11 months ago

Download leo.saclay.inria.fr

Due to resource constraints, search engines usually have difﬁculties keeping the local database completely synchronized with the Web. To detect as many changes as possible, the crawler used by a search engine should be able to predict the change behavior of webpages so that it can use the limited resource to download those webpages that are most likely to change. Towards this goal, we propose using sampling approach at the level of a cluster. We ﬁrst group all the local webpages into diﬀerent clusters such that each cluster contains webpages with similar change patterns. We then sample webpages from each cluster to estimate the change frequency of all the webpages in that cluster, and the cluster containing webpages with higher change frequency will be revisited more often by our crawler. We run extensive experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 websites. The results show that by applying our clustering algorithm, pages with similar...

Qingzhao Tan, Ziming Zhuang, Prasenjit Mitra, C. L

Real-time Traffic

Internet Technology | Search Engine | Similar Change Patterns | WEBDB 2007 | Webpages |

claim paper

» Noise Resistant Graph Ranking for Improved Web Image Search

» 3D Shape Histograms for Similarity Search and Classification in Spatial Databases

» Towards mobilitybased clustering

» Exploring the academic invisible web

» Structure Learning of Markov Logic Networks through Iterated Local Search

» ARISTA Image Search to Annotation on Billions of Web Photos

» Identification of biomarkers for genotyping Aspergilli using nonlinear methods for cluster...

» Analysis of Minimum Distances in HighDimensional Musical Spaces

Post Info
More Details (n/a)

Added	09 Jun 2010
Updated	09 Jun 2010
Type	Conference
Year	2007
Where	WEBDB
Authors	Qingzhao Tan, Ziming Zhuang, Prasenjit Mitra, C. Lee Giles

Comments (0)

Sciweavers

A clustering-based sampling approach for refreshing search engine's database

Internet Technology | Search Engine | Similar Change Patterns | WEBDB 2007 | Webpages |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers