This work aims to provide a novel, site-specific web page segmentation and section importance detection algorithm, which leverages structural, content, and visual information. The structural and content information is leveraged via template, a generalized regular expression learnt over set of pages. The template along with visual information results into high sectioning accuracy. The experimental results demonstrate the effectiveness of the approach. Categories and Subject Descriptors: H.3.3 [Information Storage, Retrieval]: Information Extraction General Terms: Algorithms, Design
Rupesh R. Mehta, Amit Madaan