Sciweavers

23 search results - page 2 / 5
» On near-uniform URL sampling
Sort
View
WWW
2006
ACM
14 years 1 months ago
Do not crawl in the DUST: different URLs with similar text
We consider the problem of dust: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and...
Uri Schonfeld, Ziv Bar-Yossef, Idit Keidar
WWW
2010
ACM
14 years 2 months ago
A pattern tree-based approach to learning URL normalization rules
Duplicate URLs have brought serious troubles to the whole pipeline of a search engine, from crawling, indexing, to result serving. URL normalization is to transform duplicate URLs...
Tao Lei, Rui Cai, Jiang-Ming Yang, Yan Ke, Xiaodon...
WWW
2008
ACM
14 years 8 months ago
Behavioral classification on the click graph
A bipartite query-URL graph, where an edge indicates that a document was clicked for a query, is a useful construct for finding groups of related queries and URLs. Here we use thi...
Martin Szummer, Nick Craswell
STACS
2009
Springer
14 years 2 months ago
A Comparison of Techniques for Sampling Web Pages
As the World Wide Web is growing rapidly, it is getting increasingly challenging to gather representative information about it. Instead of crawling the web exhaustively one has to...
Eda Baykan, Monika Rauch Henzinger, Stefan F. Kell...
WEBDB
2007
Springer
159views Database» more  WEBDB 2007»
14 years 1 months ago
A clustering-based sampling approach for refreshing search engine's database
Due to resource constraints, search engines usually have difficulties keeping the local database completely synchronized with the Web. To detect as many changes as possible, the ...
Qingzhao Tan, Ziming Zhuang, Prasenjit Mitra, C. L...