Open Information Extraction (OIE) is a recently-introduced type of information extraction that extracts small individual pieces of data from input text without any domainspecific guidance such as special training data or extraction rules. For example, an OIE system might discover the triple Frenzy, year, 1972 from a set of documents about movies. Because OIE is domain-independent, it promises to help users when they have a corpus of structured data, but that structure is unknown, such as when browsing a novel domain or formulating a query. We can describe the structure to the user by displaying a relational schema that fits the extracted data. Unfortunately, the extractions do not carry full schema information: we have extracted values, but not the correct relations, their rows, or their columns. In response we propose TGen, an algorithm for schema discovery, which automatically derives a high-quality relational schema for the extracted data. Different applications have different ...
Michael J. Cafarella, Dan Suciu, Oren Etzioni