Boilerplate Detection using Shallow Text Features

14 years 9 months ago

Download www.wsdm-conference.org

In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, stateof-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable accuracy. Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval General Terms Algorit...

Christian Kohlschütter, Peter Fankhauser, Wol

Real-time Traffic

Boilerplate Creation Process | Boilerplate Removal | Boilerplate Text | Data Mining | WSDM 2010 |

claim paper

Post Info
More Details (n/a)

Added	01 Mar 2010
Updated	02 Mar 2010
Type	Conference
Year	2010
Where	WSDM
Authors	Christian Kohlschütter, Peter Fankhauser, Wolfgang Nejdl

Comments (0)

Sciweavers

Boilerplate Detection using Shallow Text Features

Boilerplate Creation Process | Boilerplate Removal | Boilerplate Text | Data Mining | WSDM 2010 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers