Improving State-of-the-Art OCR through High-Precision Document-Specific Modeling

16 years 3 months ago

Download vis-www.cs.umass.edu

Optical character recognition (OCR) remains a difficult problem for noisy documents or documents not scanned at high resolution. Many current approaches rely on stored font models that are vulnerable to cases in which the document is noisy or is written in a font dissimilar to the stored fonts. We address these problems by learning character models directly from the document itself, rather than using pre-stored font models. This method has had some success in the past, but we are able to achieve substantial improvement in error reduction through a novel method for creating nearly error-free document-specific training data and building character appearance models from this data. In particular, we first use the state-of-the-art OCR system Tesseract to produce an initial translation. Then, our method identifies a subset of words that we have high confidence have been recognized correctly and uses this subset to bootstrap document-specific character models. We present theoretical justific...

Andrew Kae, Gary Huang, Erik Learned-miller, Carl

Real-time Traffic

Character Models | Computer Vision | CVPR 2010 | Document-specific Character Models | Font Models |

claim paper

Post Info
More Details (n/a)

Added	01 Apr 2010
Updated	14 May 2010
Type	Conference
Year	2010
Where	CVPR
Authors	Andrew Kae, Gary Huang, Erik Learned-miller, Carl Doersch

Comments (0)

Sciweavers

Improving State-of-the-Art OCR through High-Precision Document-Specific Modeling

Character Models | Computer Vision | CVPR 2010 | Document-specific Character Models | Font Models |

Explore & Download

Productivity Tools

Sciweavers