Non-interactive OCR Post-correction for Giga-Scale Digitization Projects

15 years 9 months ago

Download ilk.uvt.nl

This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce 'tickle') focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Levenshtein distance (henceforth: ld). Simple text-induced filtering techniques help to retain as many as possible of the true positives and to discard as many as possible of the false positives. ticcl has been evaluated on a contemporary OCR-ed Dutch text corpus and on a corpus of historical newspaper articles, whose OCR-quality is far lower and which is in an older Dutch spelling. Representative samples of typographical variants from both corpora have allowed us not only to properly evaluate our system, but also to draw effective conclusions towards the adaptation of the adopted correction...

Martin Reynaert

Real-time Traffic

CICLING 2008 | Natural Language Processing | OCR-induced Typographical Variation | Typographical | Typographical Variants |

claim paper

Post Info
More Details (n/a)

Added	12 Oct 2010
Updated	12 Oct 2010
Type	Conference
Year	2008
Where	CICLING
Authors	Martin Reynaert

Comments (0)

Sciweavers

Non-interactive OCR Post-correction for Giga-Scale Digitization Projects

CICLING 2008 | Natural Language Processing | OCR-induced Typographical Variation | Typographical | Typographical Variants |

Explore & Download

Productivity Tools

Sciweavers