From the standpoint of the automated extraction of scientific knowledge, an important but little-studied part of scientific publications are the figures and accompanying captions. Captions are dense in information, but also contain many extra-grammatical constructs, making them awkward to process with standard information extraction methods. We propose a scheme for "understanding" captions in biomedical publications by extracting and classifying "image pointers" (references to the accompanying image). We evaluate a number of automated methods for this task, including hand-coded methods, methods based on existing learning techniques, and methods based on novel learning techniques. The best of these methods leads to a usefully accurate tool for caption-understanding, with both recall and precision in excess of 94% on the most important single class in a combined extraction/classification task. General Terms Information extraction Keywords Information extraction, bioi...
William W. Cohen, Richard C. Wang, Robert F. Murph