Automatic Assessment of OCR Quality in Historical Documents

10 years 3 months ago

Download psi.cse.tamu.edu

Mass digitization of historical documents is a challenging problem for optical character recognition (OCR) tools. Issues include noisy backgrounds and faded text due to aging, border/marginal noise, bleed-through, skewing, warping, as well as irregular fonts and page layouts. As a result, OCR tools often produce a large number of spurious bounding boxes (BBs) in addition to those that correspond to words in the document. This paper presents an iterative classification algorithm to automatically label BBs (i.e., as text or noise) based on their spatial distribution and geometry. The approach uses a rule-base classifier to generate initial text/noise labels for each BB, followed by an iterative classifier that refines the initial labels by incorporating local information to each BB, its spatial location, shape and size. When evaluated on a dataset containing over 72,000 manually-labeled BBs from 159 historical documents, the algorithm can classify BBs with 0.95 precision and 0.96 recall...

Anshul Gupta, Ricardo Gutierrez-Osuna, Matthew Chr

Real-time Traffic

AAAI 2015 | Intelligent Agents |

claim paper

» Efficient automatic OCR word validation using word partial format derivation and language ...

» WordBased Adaptive OCR for Historical Books

» Character Enhancement for Historical Newspapers Printed Using Hot Metal Typesetting

» Textimage alignment for historical handwritten documents

» Evaluation of Spoken Document Retrieval for Historic Speech Collections

» Evaluating SEE A Benchmarking System for Document Page Segmentation

» HistoSketch A SemiAutomatic Annotation Tool for Archival Documents

» Automatic Filter Selection Using Image Quality Assessment

» Vectorization of Glyphs and Their Representation in SVG for XML based Processing

Post Info
More Details (n/a)

Added	27 Mar 2016
Updated	27 Mar 2016
Type	Journal
Year	2015
Where	AAAI
Authors	Anshul Gupta, Ricardo Gutierrez-Osuna, Matthew Christy, Boris Capitanu, Loretta Auvil, Liz Grumbach, Richard Furuta, Laura Mandell

Comments (0)

Sciweavers

Automatic Assessment of OCR Quality in Historical Documents

AAAI 2015 | Intelligent Agents |

Explore & Download

Productivity Tools

Sciweavers