Abstract—Whole-book recognition is a document image analysis strategy that operates on the complete set of a book’s page images using automatic adaptation to improve accuracy. Our algorithm expects to be given approximate iconic and linguistic models—derived from (generally errorful) OCR results and (generally incomplete) dictionaries—and then, guided entirely by evidence internal to the test set, corrects the models yielding improved accuracy. The iconic model describes image formation and determines the behavior of a character-image classifier. The linguistic model describes word-occurrence probabilities. In previous work, we reported that adapting the iconic model alone (with a perfect linguistic model) was able to automatically reduce word error rate on a 180-page book by a large factor. In this paper, we propose an algorithm that adapts both the iconic model and the linguistic model alternately to improve both models on the fly. The linguistic model adaptation method, wh...
Pingping Xiu, Henry S. Baird