An OCR system chosen for its high recognition rate and low percent of false positives also assigns low confidence values to many characters that are actually correct. Human operators must verify all words containing low confidence characters. We describe the creation of a lexicon optimized for automatically selectively resetting confidence values to high, thus reducing operator verification time. Two word lists, OCR Correct and OCR Incorrect, were extracted from files already processed and verified and became the standard for comparing candidate lexicons. A lexicon was selected from several candidate word lists maintained by the National Library of Medicine (NLM). In operation for about six months, lexicon assisted verification has been reducing the number of words requiring operator verification by over 50%. Background The Lister National Center for Biomedical Communications, a Research and Development Division of NLM, is developing a system [1] for semi-automated entry of journal ar...
Susan E. Hauser, A. C. Browne, George R. Thoma, Al