Sciweavers

ICDAR
2003
IEEE

Extraction, layout analysis and classification of diagrams in PDF documents

14 years 4 months ago
Extraction, layout analysis and classification of diagrams in PDF documents
Diagrams are a critical part of virtually all scientific and technical documents. Analyzing diagrams will be important for building comprehensive document retrieval systems. This paper focuses on the extraction and classification of diagrams from PDF documents. We study diagrams available in vector (not raster) format in online research papers. PDF files are parsed and their vector graphics components installed in a spatial index. Subdiagrams are found by analyzing white space gaps. A set of statistics is generated for each diagram, e.g., the number of horizontal lines and vertical lines. The statistics form a feature vector description of the diagram. The vectors are used in a kernel-based machine learning system (Support Vector Machine). Separating a set of bar graphs from non-bar-graphs gathered from 20,000 biology research papers gave a classification accuracy of
Robert P. Futrelle, Mingyan Shao, Chris Cieslik, A
Added 04 Jul 2010
Updated 04 Jul 2010
Type Conference
Year 2003
Where ICDAR
Authors Robert P. Futrelle, Mingyan Shao, Chris Cieslik, Andrea Elaina Grimes
Comments (0)