Sciweavers

IIWAS
2008

Combining content extraction heuristics: the CombinE system

14 years 1 months ago
Combining content extraction heuristics: the CombinE system
The main text content of an HTML document on the WWW is typically surrounded by additional contents, such as navigation menus, advertisements, link lists or design elements. Content Extraction (CE) is the task to identify and extract the main content. Ongoing research has spawned several CE heuristics of different quality. However, so far only the Crunch framework combines several heuristics to improve its overall CE performance. Since Crunch, though, many new algorithms have been formulated. The CombinE system is designed to test, evaluate and optimise combinations of CE heuristics. Its aim is to develop CE systems which yield better and more reliable extracts of the main content of a web document. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: Information Search and Retrieval, Systems and Software Keywords Content Extraction, filter ensembles, evaluation
Thomas Gottron
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2008
Where IIWAS
Authors Thomas Gottron
Comments (0)