Sciweavers

SAC
2005
ACM

Automatic extraction of informative blocks from webpages

14 years 5 months ago
Automatic extraction of informative blocks from webpages
Search engines crawl and index webpages depending upon their informative content. However, webpages — especially dynamically generated ones — contain items that cannot be classified as the “primary content”, e.g., navigation sidebars, advertisements, copyright notices, etc. Most end-users search for the primary content, and largely do not seek the non-informative content. A tool that assists an end-user or application to search and process information from webpages automatically, must separate the “primary content blocks” from the other blocks. In this paper, two new algorithms, ContentExtractor, and FeatureExtractor are proposed. The algorithms identify primary content blocks by i) looking for blocks that do not occur a large number of times across webpages and ii) looking for blocks with desired features respectively. They identify the primary content blocks with high precision and recall, reduce the storage requirement for search engines, result in smaller indexes and ...
Sandip Debnath, Prasenjit Mitra, C. Lee Giles
Added 26 Jun 2010
Updated 26 Jun 2010
Type Conference
Year 2005
Where SAC
Authors Sandip Debnath, Prasenjit Mitra, C. Lee Giles
Comments (0)