Optical character recognition errors and their effects on natural language processing

14 years 23 days ago

Download www.cse.lehigh.edu

Errors are unavoidable in advanced computer vision applications such as optical character recognition, and the noise induced by these errors presents a serious challenge to downstream processes that attempt to make use of such data. In this paper, we apply a new paradigm we have proposed for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Our methodology formulates error classification as an optimization problem solvable using a hierarchical dynamic programming approach. Errors and their cascading effects are isolated and analyzed as they travel through the pipeline. We present experimental results based on a large collection of scanned pages to study the varying impact depending on the nature of the error and the character(s) involved. This dataset has also been made available online to encourage future investigations. Key words: Performance evaluation

Daniel P. Lopresti

Real-time Traffic

Information Technology | Optical Character Recognition | Recognition Errors | SIGIR 2008 | Text Analysis Pipeline |

claim paper

Post Info
More Details (n/a)

Added	15 Dec 2010
Updated	15 Dec 2010
Type	Journal
Year	2008
Where	SIGIR
Authors	Daniel P. Lopresti

Comments (0)

Sciweavers

Optical character recognition errors and their effects on natural language processing

Information Technology | Optical Character Recognition | Recognition Errors | SIGIR 2008 | Text Analysis Pipeline |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers