ECON: An Approach to Extract Content from Web News Page

13 years 11 months ago

Download pages.cs.wisc.edu

Abstract--This paper provides a simple but effective approach, named ECON, to fully-automatically extract content from Web news page. ECON uses a DOM tree to represent the Web news page and leverages the substantial features of the DOM tree. ECON finds a snippet-node by which a part of the content of news is wrapped firstly, then backtracks from the snippet-node until a summary-node is found, and the entire content of news is wrapped by the summary-node. During the process of backtracking, ECON removes noise. Experimental results showed that ECON can achieve high accuracy and fully satisfy the requirements for scalable extraction. Moreover, ECON can be applied to Web news page written in many popular languages such as Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Arabic. ECON can be implemented much easily. Keywords-information extraction; Web content extraction; Web mining;

Yan Guo, Huifeng Tang, Linhai Song, Yu Wang 0009,

Real-time Traffic

APWEB 2010 | DOM Tree | ECON | Effective Approach | Internet Technology |

claim paper

Post Info
More Details (n/a)

Added	10 Feb 2011
Updated	10 Feb 2011
Type	Journal
Year	2010
Where	APWEB
Authors	Yan Guo, Huifeng Tang, Linhai Song, Yu Wang 0009, Guodong Ding

Comments (0)

Sciweavers

ECON: An Approach to Extract Content from Web News Page

APWEB 2010 | DOM Tree | ECON | Effective Approach | Internet Technology |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers