Extracting article text from the web with maximum subsequence segmentation

16 years 7 months ago

Download www2009.org

Much of the information on the Web is found in articles from online news outlets, magazines, encyclopedias, review collections, and other sources. However, extracting this content from the original HTML document is complicated by the large amount of less informative and typically unrelated material such as navigation menus, forms, user comments, and ads. Existing approaches tend to be either brittle and demand significant expert knowledge and time (manual or tool-assisted generation of rules or code), necessitate labeled examples for every different page structure to be processed (wrapper induction), require relatively uniform layout (template detection), or, as with Visual Page Segmentation (VIPS), are computationally expensive. We introduce maximum subsequence segmentation, a method of global optimization over token-level local classifiers, and apply it to the domain of news websites. Training examples are easy to obtain, both learning and prediction are linear time, and results are...

Jeff Pasternack, Dan Roth

Real-time Traffic

Internet Technology | Maximum Subsequence Segmentation | Necessitate Labeled Examples | Visual Page Segmentation | WWW 2009 |

claim paper

» Combining DOM tree and geometric layout analysis for online medical journal article segmen...

» Relation Extraction from Wikipedia Using Subtree Mining

» Mining indexing and searching for textual chemical molecule information on the web

Post Info
More Details (n/a)

Added	21 Nov 2009
Updated	21 Nov 2009
Type	Conference
Year	2009
Where	WWW
Authors	Jeff Pasternack, Dan Roth

Comments (0)

Sciweavers

Extracting article text from the web with maximum subsequence segmentation

Internet Technology | Maximum Subsequence Segmentation | Necessitate Labeled Examples | Visual Page Segmentation | WWW 2009 |

Explore & Download

Productivity Tools

Sciweavers