Sciweavers

JCDL
2006
ACM

Combining DOM tree and geometric layout analysis for online medical journal article segmentation

14 years 6 months ago
Combining DOM tree and geometric layout analysis for online medical journal article segmentation
We describe an HTML web page segmentation algorithm, which is applied to segment online medical journal articles (regular HTML and PDF-Converted-HTML files). The web page content is modeled by a zone tree structure based primarily on the geometric layout of the web page. For a given journal article, a zone tree is generated by combining DOM tree analysis and recursive X-Y cut algorithm. Combining with other visual cues, such as background color, font size, font color and so on, the page is segmented into homogeneous regions. Evaluation is conducted with 104 articles from 11 journals. Out of 9726 ground-truth zones, 9376 zones are correctly segmented, for an accuracy of 96.40%. Segmenting the entire web page into zones can significantly expedite and increase the accuracy of the subsequent information retrieval steps. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing – Indexing methods; H.3.6 [Information Storage and Retrieval]...
Jie Zou, Daniel X. Le, George R. Thoma
Added 14 Jun 2010
Updated 14 Jun 2010
Type Conference
Year 2006
Where JCDL
Authors Jie Zou, Daniel X. Le, George R. Thoma
Comments (0)