Sciweavers

AND
2010

Document: a useful level for facing noisy data

13 years 9 months ago
Document: a useful level for facing noisy data
In this paper we will present a set of experiments using large digitalized collections of books to show that logical structures can be extracted with good quality when working at document level. The proposed solution relies on a twofold method: first specific logical elements are recognized by a given method. Then models for the recognized elements are generated by combining layout, content and labeling information. These inferred models combining several kinds of information are used to correct noisy data, typical zoning, OCR and labeling errors produced by previous processing steps. This method is illustrated with the extraction of page numbers and chapter headings, two navigating elements required by digital libraries. Categories and Subject Descriptors I.7.5 [Document and Text Processing]: Document Capture Optical character recognition (OCR) - Document analysis General Terms Experimentation Keywords Logical Analysis, error correction, model
Hervé Déjean, Jean-Luc Meunier
Added 10 Feb 2011
Updated 10 Feb 2011
Type Journal
Year 2010
Where AND
Authors Hervé Déjean, Jean-Luc Meunier
Comments (0)