Combining content extraction heuristics: the CombinE system

15 years 7 months ago

Download www.informatik.uni-mainz.de

The main text content of an HTML document on the WWW is typically surrounded by additional contents, such as navigation menus, advertisements, link lists or design elements. Content Extraction (CE) is the task to identify and extract the main content. Ongoing research has spawned several CE heuristics of different quality. However, so far only the Crunch framework combines several heuristics to improve its overall CE performance. Since Crunch, though, many new algorithms have been formulated. The CombinE system is designed to test, evaluate and optimise combinations of CE heuristics. Its aim is to develop CE systems which yield better and more reliable extracts of the main content of a web document. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: Information Search and Retrieval, Systems and Software Keywords Content Extraction, filter ensembles, evaluation

Thomas Gottron

Real-time Traffic