Compressed collections for simulated crawling

14 years 11 days ago

Download www.sigir.org

Collections are a fundamental tool for reproducible evaluation of information retrieval techniques. We describe a new method for distributing the document lengths and term counts (a.k.a. within-document frequencies) of a web snapshot in a highly compressed and nonetheless quickly accessible form. Our main application is reproducibility of the behaviour of focused crawlers: by coupling our collection with the corresponding web graph compressed with WebGraph [3] we make it possible to apply text-based machine learning tools to the collection, while keeping the data set footprint small. We describe a collection based on a crawl of 100 Mpages of the .uk domain, publicly available in bundle with a Java open-source implementation of our techniques.

Alessio Orlandi, Sebastiano Vigna

Real-time Traffic

Corresponding Web Graph | Fundamental Tool | Information Retrieval Techniques | Information Technology | SIGIR 2008 |

claim paper

Post Info
More Details (n/a)

Added	15 Dec 2010
Updated	15 Dec 2010
Type	Journal
Year	2008
Where	SIGIR
Authors	Alessio Orlandi, Sebastiano Vigna

Comments (0)

Sciweavers

Compressed collections for simulated crawling

Corresponding Web Graph | Fundamental Tool | Information Retrieval Techniques | Information Technology | SIGIR 2008 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers