U-REST: an unsupervised record extraction system

16 years 7 months ago

Download people.csail.mit.edu

In this paper, we describe a system that can extract record structures from web pages with no direct human supervision. Records are commonly occurring HTML-embedded data tuples that describe people, offered courses, products, company profiles etc. We present a simplified framework for studying the problem of unsupervised record extraction ? one which separates the algorithms from the feature engineering. Our system, U-REST formalizes an approach to the problem of unsupervised record extraction using a simple a two-stage machine learning framework. The first stage involves clustering, where structurally similar regions are discovered, and the second stage involves classification, where discovered groupings (clusters of regions) are ranked by their likelihood of being records. In our work, we describe, and summarize the results of an extensive survey of features for both stages. We conclude by comparing U-REST to related systems. The results of our empirical evaluation show encouraging ...

Yuan Kui Shen, David R. Karger

Real-time Traffic

Internet Technology | Keywords Record Extraction | Stage Involves Classification | Unsupervised Record Extraction | WWW 2007 |

claim paper

» ViPER augmenting automatic information extraction with visual perceptions

» Unsupervised deduplication using crossfield dependencies

» Automatic Discovery of Action Taxonomies from Multiple Views

Post Info
More Details (n/a)

Added	22 Nov 2009
Updated	22 Nov 2009
Type	Conference
Year	2007
Where	WWW
Authors	Yuan Kui Shen, David R. Karger

Comments (0)

Sciweavers

U-REST: an unsupervised record extraction system

Internet Technology | Keywords Record Extraction | Stage Involves Classification | Unsupervised Record Extraction | WWW 2007 |

Explore & Download

Productivity Tools

Sciweavers