Sciweavers

WEBDB
2010
Springer

Redundancy-Driven Web Data Extraction and Integration

14 years 4 months ago
Redundancy-Driven Web Data Extraction and Integration
A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g., financial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments on a sample of 175,000 pages confirm the feasibility and quality of the approach.
Paolo Papotti, Valter Crescenzi, Paolo Merialdo, M
Added 11 Jul 2010
Updated 11 Jul 2010
Type Conference
Year 2010
Where WEBDB
Authors Paolo Papotti, Valter Crescenzi, Paolo Merialdo, Mirko Bronzi, Lorenzo Blanco
Comments (0)