News article extraction with template-independent wrapper

16 years 1 months ago

Download www.cs.sfu.ca

We consider the problem of template-independent news extraction. The state-of-the-art news extraction method is based on template-level wrapper induction, which has two serious limitations. 1) It cannot correctly extract pages belonging to an unseen template until the wrapper for that template has been generated. 2) It is costly to maintain up-to-date wrappers for hundreds of websites, because any change of a template may lead to the invalidation of the corresponding wrapper. In this paper we formalize news extraction as a machine learning problem and learn a template-independent wrapper using a very small number of labeled news pages from a single site. Novel features dedicated to news titles and bodies are developed respectively. Correlations between the news title and the news body are exploited. Our template-independent wrapper can extract news pages from different sites regardless of templates. In experiments, a wrapper is learned from 40 pages from a single news site. It achieve...

Junfeng Wang, Xiaofei He, Can Wang, Jian Pei, Jiaj

Real-time Traffic

Corresponding Wrapper | Internet Technology | Template-independent Wrapper | Template-level Wrapper Induction | WWW 2009 |

claim paper

» A LayoutIndependent Web News Article Contents Extraction Method Based on Relevance Analysi...

» MetaNews An Information Agent for Gathering News Articles on the Web

» Extracting article text from the web with maximum subsequence segmentation

Post Info
More Details (n/a)

Added	19 May 2010
Updated	19 May 2010
Type	Conference
Year	2009
Where	WWW
Authors	Junfeng Wang, Xiaofei He, Can Wang, Jian Pei, Jiajun Bu, Chun Chen, Ziyu Guan, Gang Lu

Comments (0)

Sciweavers

News article extraction with template-independent wrapper

Corresponding Wrapper | Internet Technology | Template-independent Wrapper | Template-level Wrapper Induction | WWW 2009 |

Explore & Download

Productivity Tools

Sciweavers