Sciweavers

ICDAR
2007
IEEE

Content-level Annotation of Large Collection of Printed Document Images

14 years 5 months ago
Content-level Annotation of Large Collection of Printed Document Images
A large annotated corpus is critical to the development of robust optical character recognizers (OCRs). However, creation of annotated corpora is a tedious task. It is laborious, especially when the annotation is at the character level. In this paper, we propose an efficient hierarchical approach for annotation of large collection of printed document images. We align document images with independently keyed-in text. The method is model-driven and is intended to annotate large collection of documents, scanned in three different resolutions, at character level. We employ an XML representation for storage of the annotation information. APIs are provided for access at content level for easy use in training and evaluation of OCRs and other document understanding tasks.
Anand Kumar 0002, C. V. Jawahar
Added 03 Jun 2010
Updated 03 Jun 2010
Type Conference
Year 2007
Where ICDAR
Authors Anand Kumar 0002, C. V. Jawahar
Comments (0)