Bounding the Probability of Error for High Precision Optical Character Recognition

13 years 9 months ago

Download jmlr.csail.mit.edu

We consider a model for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low recall. If some variables can be identiﬁed with near certainty, they can be conditioned upon, allowing further inference to be done efﬁciently. Speciﬁcally, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This “clean set” is subsequently used as document-speciﬁc training data. While OCR systems produce conﬁdence measures for the identity of each letter or word, thresholding these values still produces a signiﬁcant number of errors. We introduce a novel technique for identifying a set of correct words with very high precision. Rather than estimating posterior probabilities, we bound the probability that any given word is incorrect using an approximate worst case analysis. We give empirical results on a data...

Gary B. Huang, Andrew Kae, Carl Doersch, Erik G. L

Real-time Traffic

JMLR 2012 | OCR System | Optical Character Recognition | Programming Languages | Translation 1 |

claim paper

Added	27 Sep 2012
Updated	27 Sep 2012
Type	Journal
Year	2012
Where	JMLR
Authors	Gary B. Huang, Andrew Kae, Carl Doersch, Erik G. Learned-Miller

Sciweavers

Bounding the Probability of Error for High Precision Optical Character Recognition

JMLR 2012 | OCR System | Optical Character Recognition | Programming Languages | Translation 1 |

Explore & Download

Productivity Tools

Sciweavers