A Comparison of Techniques for Sampling Web Pages

16 years 1 months ago

Download www.ra.ethz.ch

As the World Wide Web is growing rapidly, it is getting increasingly challenging to gather representative information about it. Instead of crawling the web exhaustively one has to resort to other techniques like randomly sampling to determine the properties of the web. Unfortunately, no approach has been shown to sample the web pages in an unbiased way. Three promising web sampling algorithms are based on random walks [6, 2, 9]. They each have been evaluated individually, but on diﬀerent data sets so that a comparison is not possible. In this paper we compare these algorithms by running them on the web with the same computation power and for the same amount of time. We then propose improvements based on experimental results. Keywords URL sampling, Random walks, PageRank, Information gathering from the web.

Eda Baykan, Monika Rauch Henzinger, Stefan F. Kell

Real-time Traffic

Random Walks | STACS 2009 | Theoretical Computer Science | Web Sampling Algorithms | World Wide Web |

claim paper

Related Content

» Sampling the National Deep Web

» HostIP clustering technique for deep web characterization

» HTML Pattern GeneratorAutomatic Data Extraction from Web Pages

» An Evaluation and Comparison of Current PeertoPeer FullText Keyword Search Techniques

» Experimental Results on the Alignment of Multilingual Web Sites

» When the Web meets the cell using personalized PageRank for analyzing protein interaction ...

» A Qualitative Oriented Study About IT Procurement Processes Comparison of 4 European Count...

» An Efficient PartitionBased Parallel PageRank Algorithm

» Prophiler a fast filter for the largescale detection of malicious web pages

Post Info
More Details (n/a)

Added	20 May 2010
Updated	20 May 2010
Type	Conference
Year	2009
Where	STACS
Authors	Eda Baykan, Monika Rauch Henzinger, Stefan F. Keller, Sebastian De Castelberg, Markus Kinzler

Comments (0)

Sciweavers

A Comparison of Techniques for Sampling Web Pages

Random Walks | STACS 2009 | Theoretical Computer Science | Web Sampling Algorithms | World Wide Web |

Explore & Download

Productivity Tools

Sciweavers