Sciweavers

WIDM
2003
ACM

Schema-guided wrapper maintenance for web-data extraction

14 years 5 months ago
Schema-guided wrapper maintenance for web-data extraction
Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interests. There are two main issues relevant to Web-data extraction, namely wrapper generation and wrapper maintenance. In this paper, we propose a novel schema-guided approach to the problem of automatic wrapper maintenance. It is based on the observation that despite various page changes, many important features of the pages are preserved, such as syntactic patterns, annotations, and hyperlinks of the extracted data items. Our approach uses these preserved features to identify the locations of the desired values in the changed pages, and repair wrappers correspondingly by inducing semantic blocks from the HTML tree. Our intensive experiments on real Web sites show that the proposed approach can effectively maintain wrappers to extract desired data with high accuracies. Categories and Subject Descriptors H.2.5 [Heterogeneous Databases] H.2.8 [Database A...
Xiaofeng Meng, Dongdong Hu, Chen Li
Added 05 Jul 2010
Updated 05 Jul 2010
Type Conference
Year 2003
Where WIDM
Authors Xiaofeng Meng, Dongdong Hu, Chen Li
Comments (0)