

Automatic web news extraction using tree edit distance

15 years 15 days ago
Automatic web news extraction using tree edit distance
The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient access to relevant information within this huge repository of data. Although several techniques have been developed to the problem of Web data extraction, their use is still not spread, mostly because of the need for high human intervention and the low quality of the extraction results. In this paper, we present a domain-oriented approach to Web data extraction and discuss its application to automatically extracting news from Web sites. Our approach is based on a highly efficient tree structure analysis that produces very effective results. We have tested our approach with several important Brazilian on-line news sites and achieved very precise results, correctly extracting 87.71% of the news in a set of 4088 pages distributed among 35 different sites. Categories and Subject Descriptors H.3.m [Information Storage and Retrieval]: Mis...
Davi de Castro Reis, Paulo Braz Golgher, Altigran
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2004
Where WWW
Authors Davi de Castro Reis, Paulo Braz Golgher, Altigran Soares da Silva, Alberto H. F. Laender
Comments (0)