Sciweavers

CIKM
2003
Springer

Extracting unstructured data from template generated web documents

14 years 4 months ago
Extracting unstructured data from template generated web documents
We propose a novel approach that identifies web page templates and extracts the unstructured data. Extracting only the body of the page and eliminating the template increases the retrieval precision for the queries that generate irrelevant results. We believe that by reducing the number of irrelevant results; the users are encouraged to go back to a given site to search. Our experimental results on several different web sites and on the whole cnnfn collection demonstrate the feasibility of our approach. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Search Process General Terms Design, Experimentation Keywords Automatic template removal, text extraction, information retrieval, Retrieval Accuracy
Ling Ma, Nazli Goharian, Abdur Chowdhury, Misun Ch
Added 06 Jul 2010
Updated 06 Jul 2010
Type Conference
Year 2003
Where CIKM
Authors Ling Ma, Nazli Goharian, Abdur Chowdhury, Misun Chung
Comments (0)