Diagrams are a critical part of virtually all scientific and technical documents. Analyzing diagrams will be important for building comprehensive document retrieval systems. This paper focuses on the extraction and classification of diagrams from PDF documents. We study diagrams available in vector (not raster) format in online research papers. PDF files are parsed and their vector graphics components installed in a spatial index. Subdiagrams are found by analyzing white space gaps. A set of statistics is generated for each diagram, e.g., the number of horizontal lines and vertical lines. The statistics form a feature vector description of the diagram. The vectors are used in a kernel-based machine learning system (Support Vector Machine). Separating a set of bar graphs from non-bar-graphs gathered from 20,000 biology research papers gave a classification accuracy of
Robert P. Futrelle, Mingyan Shao, Chris Cieslik, A