Motivated by emerging needs in online interactions, we define a new type of digital signature called a `Content Extraction Signature' (CES). A CES allows the owner, Bob, of a...
The main text content of an HTML document on the WWW is typically surrounded by additional contents, such as navigation menus, advertisements, link lists or design elements. Conte...
We describe a methodology for retrieving document images from large extremely diverse collections. First we perform content extraction, that is the location and measurement of reg...
Previous content extraction evaluations have neglected to address problems which complicate the incorporation of extracted information into an existing knowledge base. Previous qu...
Paul McNamee, Hoa Trang Dang, Heather Simpson, Pat...
We consider the problem of content extraction from online news webpages. To explore to what extent the syntactic markup and the visual structure of a webpage facilitate the extrac...
We present Content Extraction via Tag Ratios (CETR) – a method to extract content text from diverse webpages by using the HTML document’s tag ratios. We describe how to comput...
Web pages often contain clutter (such as pop-up ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction o...
Suhit Gupta, Gail E. Kaiser, David Neistadt, Peter...
Web pages contain clutter (such as ads, unnecessary images and extraneous links) around the body of an article, which distracts a user from actual content. Extraction of "use...