Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

152

DEXAW
2008
IEEE

123views Database» more DEXAW 2008»

Text Extraction from the Web via Text-to-Tag Ratio

16 years 1 months ago

Text Extraction from the Web via Text-to-Tag Ratio

Download www.uni-weimar.de

– We describe a method to extract content text from diverse Web pages by using the HTML document’s Text-to-Tag Ratio rather than specific HTML cues that may not be constant across various Web pages. We describe how to compute the Text-to-Tag Ratio on a line-by-line basis and then cluster the results into content and non-content areas. With this approach we then show surprisingly high levels of recall for all levels of precision, and a large space savings.

Tim Weninger, William H. Hsu

Real-time Traffic

Database | DEXAW 2008 | Document’s Text-to-tag Ratio | Text-to-Tag Ratio | Web Pages |

claim paper

Related Content

» CETR content extraction via tag ratios

» RelExt A Tool for Relation Extraction from Text in Ontology Extension

» Coherent Keyphrase Extraction via Web Mining

» Information Extraction via Path Merging

» Yahoo for Amazon Sentiment Extraction from Small Talk on the Web

» Hunting for the Black Swan Risk Mining from Text

» Mining and Modeling Relations between Formal and Informal Chinese Phrases from Web Corpora

» Querying for relations from the semistructured Web

» Classificationaware hiddenweb text database selection

Post Info
More Details (n/a)

Added	29 May 2010
Updated	29 May 2010
Type	Conference
Year	2008
Where	DEXAW
Authors	Tim Weninger, William H. Hsu

Comments (0)