Sciweavers

JCDL
2005
ACM

Automatic extraction of titles from general documents using machine learning

14 years 5 months ago
Automatic extraction of titles from general documents using machine learning
In this paper, we propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods have been proposed mainly for title extraction from research papers. It has not been clear whether it could be possible to conduct automatic title extraction from general documents. As a case study, we consider extraction from Office including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint, respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize formatting information such as font size as features in the models. It turns out that the use of formatting information can lead to quite accurate extraction from g...
Yunhua Hu, Hang Li, Yunbo Cao, Dmitriy Meyerzon, Q
Added 26 Jun 2010
Updated 26 Jun 2010
Type Conference
Year 2005
Where JCDL
Authors Yunhua Hu, Hang Li, Yunbo Cao, Dmitriy Meyerzon, Qinghua Zheng
Comments (0)