WebTables: exploring the power of tables on the web

14 years 5 months ago

Download turing.cs.washington.edu

The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google's general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each relational table has its own "schema" of labeled and typed columns, each such table can be considered a small structured database. The resulting corpus of databases is larger than any other corpus we are aware of, by at least five orders of magnitude. We describe the WebTables system to explore two fundamental questions about this collection of databases. First, what are effective techniques for searching for structured data at search-engine scales? Second, what additional power can be derived by analyzing such a huge corpus? First, we develop new techniques for keyword search over a corpus of tables, and show that they can achiev...

Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wa

Real-time Traffic

Database | HTML Tables | PVLDB 2008 | Schema Elements |

claim paper

» Enabling Interactive Access to Web Tables

» Finding related tables

» Performance Studies of a WebSphere Application Trade in Scaleout and Scaleup Environments

» Relational Databases for Querying XML Documents Limitations and Opportunities

Post Info
More Details (n/a)

Added	28 Dec 2010
Updated	28 Dec 2010
Type	Journal
Year	2008
Where	PVLDB
Authors	Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu 0002, Yang Zhang

Comments (0)

Sciweavers

WebTables: exploring the power of tables on the web

Database | HTML Tables | PVLDB 2008 | Schema Elements |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers