Schema-guided wrapper maintenance for web-data extraction

15 years 11 months ago

Download www.ics.uci.edu

Extracting data from Web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interests. There are two main issues relevant to Web-data extraction, namely wrapper generation and wrapper maintenance. In this paper, we propose a novel schema-guided approach to the problem of automatic wrapper maintenance. It is based on the observation that despite various page changes, many important features of the pages are preserved, such as syntactic patterns, annotations, and hyperlinks of the extracted data items. Our approach uses these preserved features to identify the locations of the desired values in the changed pages, and repair wrappers correspondingly by inducing semantic blocks from the HTML tree. Our intensive experiments on real Web sites show that the proposed approach can effectively maintain wrappers to extract desired data with high accuracies. Categories and Subject Descriptors H.2.5 [Heterogeneous Databases] H.2.8 [Database A...

Xiaofeng Meng, Dongdong Hu, Chen Li

Real-time Traffic

Automatic Wrapper Maintenance | WIDM 2003 | Wrapper Generation | Wrapper Maintenance |

claim paper

» Maintaining Web Navigation Flows for Wrappers

» Datarover a taxonomy based crawler for automated data extraction from dataintensive websit...

Post Info
More Details (n/a)

Added	05 Jul 2010
Updated	05 Jul 2010
Type	Conference
Year	2003
Where	WIDM
Authors	Xiaofeng Meng, Dongdong Hu, Chen Li

Comments (0)

Sciweavers

Schema-guided wrapper maintenance for web-data extraction

Automatic Wrapper Maintenance | WIDM 2003 | Wrapper Generation | Wrapper Maintenance |

Explore & Download

Productivity Tools

Sciweavers