Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

157

CIKM
2003
Springer

129views Information Technology» more CIKM 2003»

Extracting unstructured data from template generated web documents

15 years 12 months ago

Extracting unstructured data from template generated web documents

Download www.ir.iit.edu

We propose a novel approach that identifies web page templates and extracts the unstructured data. Extracting only the body of the page and eliminating the template increases the retrieval precision for the queries that generate irrelevant results. We believe that by reducing the number of irrelevant results; the users are encouraged to go back to a given site to search. Our experimental results on several different web sites and on the whole cnnfn collection demonstrate the feasibility of our approach. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Search Process General Terms Design, Experimentation Keywords Automatic template removal, text extraction, information retrieval, Retrieval Accuracy

Ling Ma, Nazli Goharian, Abdur Chowdhury, Misun Ch

Real-time Traffic

CIKM 2003 | Retrieval Precision | We Believe | Web Page Templates |

claim paper

Related Content

» OntologyBased Extraction and Structuring of Information from DataRich Unstructured Documen...

» TemplateBased Information Mining from HTML Documents

» Business Insight from Collection of Unstructured Formatted Documents with IBM Content Harv...

» Queryrelated data extraction of hidden web documents

» FiVaTech PageLevel Web Data Extraction from Template Pages

» Removing manually generated boilerplate from electronic texts experiments with project Gut...

» Extracting reusable document components for variable data printing

» Business Specific Online Information Extraction from German Websites

» Extracting Structured Data from Web Pages

Post Info
More Details (n/a)

Added	06 Jul 2010
Updated	06 Jul 2010
Type	Conference
Year	2003
Where	CIKM
Authors	Ling Ma, Nazli Goharian, Abdur Chowdhury, Misun Chung

Comments (0)