ViPER: augmenting automatic information extraction with visual perceptions

16 years 3 days ago

Download www.informatik.uni-freiburg.de

In this paper we address the problem of unsupervised Web data extraction. We show that unsupervised Web data extraction becomes feasible when supposing pages that are made up of repetitive patterns, as it is the case, e.g., for search engine result pages. Hereby the extraction rules are generated automatically without any training or human interaction, by means of operating on the DOM tree respectively the ﬂat tag token sequence of a single page. Our contribution to automatic data extraction through this paper is twofold. First, we identify and rank potential repetitive patterns with respect to the user’s visual perception of the Web page, well aware that location and size of matching elements within a Web page constitute important criteria for deﬁning relevance. Second, matching subsequences of the pattern with the highest weightiness are aligned with global multiple sequence alignment techniques. Experimental results show that our system is able to achieve high accuracy in dis...

Kai Simon, Georg Lausen

Real-time Traffic