As the World Wide Web is growing rapidly, it is getting increasingly challenging to gather representative information about it. Instead of crawling the web exhaustively one has to resort to other techniques like randomly sampling to determine the properties of the web. Unfortunately, no approach has been shown to sample the web pages in an unbiased way. Three promising web sampling algorithms are based on random walks [6, 2, 9]. They each have been evaluated individually, but on different data sets so that a comparison is not possible. In this paper we compare these algorithms by running them on the web with the same computation power and for the same amount of time. We then propose improvements based on experimental results. Keywords URL sampling, Random walks, PageRank, Information gathering from the web.
Eda Baykan, Monika Rauch Henzinger, Stefan F. Kell