Information retrieval for OCR documents: a content-based probabilistic correction model

15 years 8 months ago

Download www.informedia.cs.cmu.edu

The difficulty with information retrieval for OCR documents lies in the fact that OCR documents comprise of a significant amount of erroneous words and unfortunately most information retrieval techniques rely heavily on word matching between documents and queries. In this paper, we propose a general content-based correction model that can work on top of an existing OCR correction tool to “boost” retrieval performance. The basic idea of this correction model is to exploit the whole content of a document to supplement any other useful information provided by an existing OCR correction tool for word corrections. Instead of making an explicit correction decision for each erroneous word as typically done in a traditional approach, we consider the uncertainties in such correction decisions and compute an estimate of the original “uncorrupted” document language model accordingly. The document language model can then be used for retrieval with a language modeling retrieval approach. E...

Rong Jin, ChengXiang Zhai, Alexander G. Hauptmann

Real-time Traffic

Content-based Correction Model | Correction Model | Document Analysis | DRR 2003 | OCR Correction Tool |

claim paper

Added	31 Oct 2010
Updated	31 Oct 2010
Type	Conference
Year	2003
Where	DRR
Authors	Rong Jin, ChengXiang Zhai, Alexander G. Hauptmann

Sciweavers

Information retrieval for OCR documents: a content-based probabilistic correction model

Content-based Correction Model | Correction Model | Document Analysis | DRR 2003 | OCR Correction Tool |

Explore & Download

Productivity Tools

Sciweavers