Using Common Schemas for Information Extraction from Heterogeneous Web Catalogs

15 years 11 months ago

Download www.fzi.de

The Web has become the world’s largest information source. Unfortunately, the main success factor of the Web, the inherent principle of distribution and autonomy of the participants, is also its main problem. When trying to make this information machine processable, common structures and semantics have to be identified. The goal of information extraction (IE) is exactly this, to transform text into a structural format. In this paper, we present a novel approach for information extraction developed as part of the XI³ project. Central to our approach is the assumption that we can obtain a better understanding of a text fragment if we consider its integration into higher-level concepts by exploiting text fragments from different parts of a source. In addition to previous approaches, we offer higher expressiveness of the extraction schema and an advanced method to deal with ambiguous texts. With our approach we solve one of the main challenges of information extraction, providing a way ...

Richard Vlach, Wassili Kazakos

Real-time Traffic