Sciweavers

SIGMOD
2008
ACM

Web-scale extraction of structured data

15 years 17 days ago
Web-scale extraction of structured data
A long-standing goal of Web research has been to construct a unified Web knowledge base. Information extraction techniques have shown good results on Web inputs, but even most domain-independent ones are not appropriate for Web-scale operation. In this paper we describe three recent extraction systems that can be operated on the entire Web (two of which come from Google Research). The TextRunner system focuses on raw natural language text, the WebTables system focuses on HTML-embedded tables, and the deep-web surfacing system focuses on "hidden" databases. The domain, expressiveness, and accuracy of extracted data can depend strongly on its source extractor; we describe differences in the characteristics of data produced by the three extractors. Finally, we discuss a series of unique data applications (some of which have already been prototyped) that are enabled by aggregating extracted Web information.
Michael J. Cafarella, Jayant Madhavan, Alon Y. Hal
Added 08 Dec 2009
Updated 08 Dec 2009
Type Conference
Year 2008
Where SIGMOD
Authors Michael J. Cafarella, Jayant Madhavan, Alon Y. Halevy
Comments (0)