CETR: content extraction via tag ratios

16 years 1 months ago

Download www.cs.illinois.edu

We present Content Extraction via Tag Ratios (CETR) – a method to extract content text from diverse webpages by using the HTML document’s tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we ﬁnd that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we extend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering technique which operates on the two-dimensional model, and then evaluate our approach against a large set of alternative methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: [In...

Tim Weninger, William H. Hsu, Jiawei Han

Real-time Traffic

Content Extraction | Document’s Tag Ratios | Internet Technology | Tag Ratio | WWW 2010 |

claim paper

» Improved annotation of the blogosphere via autotagging and hierarchical clustering

» Conceptualization of place via spatial clustering and cooccurrence analysis

» Supporting Webbased Address Extraction with Unsupervised Tagging

» Data Management for XML Research Directions

Post Info
More Details (n/a)

Added	14 May 2010
Updated	14 May 2010
Type	Conference
Year	2010
Where	WWW
Authors	Tim Weninger, William H. Hsu, Jiawei Han

Comments (0)

Sciweavers

CETR: content extraction via tag ratios

Content Extraction | Document’s Tag Ratios | Internet Technology | Tag Ratio | WWW 2010 |

Explore & Download

Productivity Tools

Sciweavers