Extracting and Querying a Comprehensive Web Database

15 years 8 months ago

Download turing.cs.washington.edu

Recent research in domain-independent information extraction holds the promise of an automatically-constructed structured database derived from the Web. A query system based on this database would offer the same breadth as a Web search engine, but with much more sophisticated query tools than are common today. Unfortunately, these domain-independent Web extractors are usually not modelindependent; e.g., an extractor that only finds binary relations from text will be blind to relational data found in tables. Because a topic area often has a data model that is a natural fit (e.g., population statistics are usually in tables, while biographical facts about Einstein are embedded in text), even a high-quality domain-independent extractor will miss a substantial amount of data. Our omnivore system attempts to build a comprehensive Web database by running multiple domain-independent extractors in parallel over a Web crawl, then combining their outputs into a single large entity-relationship ...

Michael J. Cafarella

Real-time Traffic