Sciweavers

WWW
2009
ACM

Extracting article text from the web with maximum subsequence segmentation

15 years 1 months ago
Extracting article text from the web with maximum subsequence segmentation
Much of the information on the Web is found in articles from online news outlets, magazines, encyclopedias, review collections, and other sources. However, extracting this content from the original HTML document is complicated by the large amount of less informative and typically unrelated material such as navigation menus, forms, user comments, and ads. Existing approaches tend to be either brittle and demand significant expert knowledge and time (manual or tool-assisted generation of rules or code), necessitate labeled examples for every different page structure to be processed (wrapper induction), require relatively uniform layout (template detection), or, as with Visual Page Segmentation (VIPS), are computationally expensive. We introduce maximum subsequence segmentation, a method of global optimization over token-level local classifiers, and apply it to the domain of news websites. Training examples are easy to obtain, both learning and prediction are linear time, and results are...
Jeff Pasternack, Dan Roth
Added 21 Nov 2009
Updated 21 Nov 2009
Type Conference
Year 2009
Where WWW
Authors Jeff Pasternack, Dan Roth
Comments (0)