Sciweavers

CICLING
2008
Springer

Non-interactive OCR Post-correction for Giga-Scale Digitization Projects

14 years 1 months ago
Non-interactive OCR Post-correction for Giga-Scale Digitization Projects
This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce 'tickle') focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Levenshtein distance (henceforth: ld). Simple text-induced filtering techniques help to retain as many as possible of the true positives and to discard as many as possible of the false positives. ticcl has been evaluated on a contemporary OCR-ed Dutch text corpus and on a corpus of historical newspaper articles, whose OCR-quality is far lower and which is in an older Dutch spelling. Representative samples of typographical variants from both corpora have allowed us not only to properly evaluate our system, but also to draw effective conclusions towards the adaptation of the adopted correction...
Martin Reynaert
Added 12 Oct 2010
Updated 12 Oct 2010
Type Conference
Year 2008
Where CICLING
Authors Martin Reynaert
Comments (0)