Sciweavers

SOFSEM
2007
Springer

Creating Permanent Test Collections of Web Pages for Information Extraction Research

14 years 6 months ago
Creating Permanent Test Collections of Web Pages for Information Extraction Research
In the research area of automatic web information extraction, there is a need for permanent and annotated web page collections enabling objective performance evaluation of different algorithms. Currently, researchers are suffering from the absence of such representative and contemporary test collections, especially on web tables. At the same time, creating your own sharable web page collections is not trivial nowadays because of the dynamic and diverse nature of modern web technologies employed to create often shortlived online content. In this paper, we cover the problem of creating static representations of web pages in order to build sharable ground truth test sets. We explain the principal difficulties of the problem, discuss possible approaches and introduce our solution: WebPageDump, a Firefox extension capable of saving web pages exactly as they are rendered online. Finally, we benchmark our system with current alternatives using an innovative automatic method based on image sna...
Bernhard Pollak, Wolfgang Gatterbauer
Added 09 Jun 2010
Updated 09 Jun 2010
Type Conference
Year 2007
Where SOFSEM
Authors Bernhard Pollak, Wolfgang Gatterbauer
Comments (0)