Business Specific Online Information Extraction from German Websites

15 years 1 months ago

Download www.cis.uni-muenchen.de

This paper presents a system that uses the domain name of a German business website to locate its information pages (e.g. company profile, contact page, imprint) and then identifies business specific information. We therefore concentrate on the extraction of characteristic vocabulary like company names, addresses, contact details, CEOs, etc. Above all, we interpret the HTML structure of documents and analyze some contextual facts to transform the unstructured web pages into structured forms. Our approach is quite robust in variability of the DOM, upgradeable and keeps data upto-date. The evaluation experiments show high efficiency of information access to the generated data. Hence, the developed technique is adaptive to non-German websites with slight language-specific modifications, and experimental results on real-life websites confirm the feasibility of the approach. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Search process; I.2.7 [Natural Language...

Yeong Su Lee, Michaela Geierhos

Real-time Traffic

CICLING 2009 | Natural Language Processing | Slight Language-specific Modifications | Terms Company Search | Unstructured Web Pages |

claim paper

Post Info
More Details (n/a)

Added	24 Nov 2009
Updated	24 Nov 2009
Type	Conference
Year	2009
Where	CICLING
Authors	Yeong Su Lee, Michaela Geierhos

Comments (0)

Sciweavers

Business Specific Online Information Extraction from German Websites

CICLING 2009 | Natural Language Processing | Slight Language-specific Modifications | Terms Company Search | Unstructured Web Pages |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers