Sciweavers

708 search results - page 10 / 142
» Identifying Content Blocks from Web Documents
Sort
View
ACL
2006
13 years 10 months ago
Examining the Content Load of Part of Speech Blocks for Information Retrieval
We investigate the connection between part of speech (POS) distribution and content in language. We define POS blocks to be groups of parts of speech. We hypothesise that there ex...
Christina Lioma, Iadh Ounis
WWW
2003
ACM
14 years 9 months ago
DOM-based content extraction of HTML documents
Web pages often contain clutter (such as pop-up ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction o...
Suhit Gupta, Gail E. Kaiser, David Neistadt, Peter...
JUCS
2008
185views more  JUCS 2008»
13 years 8 months ago
Recognising Informative Web Page Blocks Using Visual Segmentation for Efficient Information Extraction
Abstract: As web sites are getting more complicated, the construction of web information extraction systems becomes more troublesome and time-consuming. A common theme is the diffi...
Jinbeom Kang, Joongmin Choi
ICDAR
2009
IEEE
14 years 3 months ago
PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents
This paper presents PDF-TREX, an heuristic approach for table recognition and extraction from PDF documents. The heuristics starts from an initial set of basic content elements an...
Ermelinda Oro, Massimo Ruffolo
WWW
2004
ACM
14 years 9 months ago
Web page summarization using dynamic content
Summarizing web pages have recently gained much attention from researchers. Until now two main types of approaches have been proposed for this task: content- and context-based met...
Adam Jatowt