A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books

16 years 1 months ago

Download ciir.cs.umass.edu

A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar eﬀorts from Yahoo and Microsoft. Content-based on line book retrieval usually requires ﬁrst converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can aﬀect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two lar...

Shaolei Feng, R. Manmatha

Real-time Traffic

Book Project | JCDL 2006 | OCR Output | OCR Performance |

claim paper

» Transferring structural markup across translations using multilingual alignment and projec...

» Self Adaptable Recognizer for Document Image Collections

» Analysis of wholebook recognition

Post Info
More Details (n/a)

Added	14 Jun 2010
Updated	14 Jun 2010
Type	Conference
Year	2006
Where	JCDL
Authors	Shaolei Feng, R. Manmatha

Comments (0)

Sciweavers

A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books

Book Project | JCDL 2006 | OCR Output | OCR Performance |

Explore & Download

Productivity Tools

Sciweavers