Exploiting content redundancy for web information extraction

15 years 6 months ago

Download www.comp.nus.edu.sg

We propose a novel extraction approach that exploits content redundancy on the web to extract structured data from template-based web sites. We start by populating a seed database with records extracted from a few initial sites. We then identify values within the pages of each new site that match attribute values contained in the seed set of records. To match attribute values with diverse representations across sites, we define a new similarity metric that leverages the templatized structure of attribute content. Specifically, our metric discovers the matching pattern between attribute values from two sites, and uses this to ignore extraneous portions of attribute values when computing similarity scores. Further, to filter out noisy attribute value matches, we exploit the fact that attribute values occur at fixed positions within template-based sites. We develop an efficient Apriori-style algorithm to systematically enumerate attribute position configurations with sufficient matching ...

Pankaj Gulhane, Rajeev Rastogi, Srinivasan H. Seng

Real-time Traffic

Attribute Values | Internet Technology | Match Attribute | Template-based Web Sites | WWW 2010 |

claim paper

» RedundancyDriven Web Data Extraction and Integration

» Extracting Instances of Relations from Web Documents Using Redundancy

» Elimination of Redundant Information for Web Data Mining

» Discovering informative content blocks from Web documents

» Inferring user intent in web search by exploiting social annotations

» A RedundancyBased Method for Relation Instantiation from the Web

» Enriching the Contents of Enterprises Wiki Systems with Web Information

» Towards a Search System for the Web Exploiting Spatial Data of a Web Document

Post Info
More Details (n/a)

Added	06 Dec 2010
Updated	06 Dec 2010
Type	Conference
Year	2010
Where	WWW
Authors	Pankaj Gulhane, Rajeev Rastogi, Srinivasan H. Sengamedu, Ashwin Tengli

Comments (0)

Sciweavers

Exploiting content redundancy for web information extraction

Attribute Values | Internet Technology | Match Attribute | Template-based Web Sites | WWW 2010 |

Explore & Download

Productivity Tools

Sciweavers