There is extensive interest in automating the collection, organization and summarization of biological data. Data in the form of figures and accompanying captions in literature present special challenges for such efforts. Based on our previously developed search engines to find fluorescence microscope images depicting protein subcellular patterns, we introduced text mining and Optical Character Recognition (OCR) techniques for caption understanding and figure-text matching, so as to build a robust, comprehensive toolset for extracting information about protein subcellular localization from the text and images found in online journals. Our current system can generate assertions such as "Figure N depicts a localization of type L for protein P in cell type C". Keywords Information extraction, Bioinformatics, text mining, image mining, fluorescence microscopy, protein localization
Zhenzhen Kou, William W. Cohen, Robert F. Murphy