Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

15 years 4 months ago

Download www.aclweb.org

Models of latent document semantics such as the mixture of multinomials model and Latent Dirichlet Allocation have received substantial attention for their ability to discover topical semantics in large collections of text. In an effort to apply such models to noisy optical character recognition (OCR) text output, we endeavor to understand the effect that character-level noise can have on unsupervised topic modeling. We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common techniques for alleviating these problems, such as filtering low-frequency words, are successful in enhancing model quality, but exhibit failure trends similar to models trained on unprocessed OCR output in the case of LDA. To our knowledge, this study is the first of its kind.

Daniel David Walker, William B. Lund, Eric K. Ring

Real-time Traffic

EMNLP 2010 | Latent Dirichlet Allocation | Natural Language Processing | Topic Analysis | Unprocessed Ocr Output |

claim paper

Post Info
More Details (n/a)

Added	11 Feb 2011
Updated	11 Feb 2011
Type	Journal
Year	2010
Where	EMNLP
Authors	Daniel David Walker, William B. Lund, Eric K. Ringger

Comments (0)

Sciweavers

Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

EMNLP 2010 | Latent Dirichlet Allocation | Natural Language Processing | Topic Analysis | Unprocessed Ocr Output |

Explore & Download

Productivity Tools

Sciweavers