An Open Source Tesseract Based Optical Character Recognizer for Bangla Script

15 years 4 months ago

Download www.cvc.uab.es

BanglaOCR is currently the only open source optical character recognition (OCR) software for the Bangla (Bengali) script developed by the Center for Research on Bangla Language Processing (CRBLP). Tesseract, maintained by Google, is considered to be one of the most accurate free open source OCR engines currently available. In this paper, we present a new OCR for the Bangla/Bengali script that combines the recognition power of Tesseract and the Bangla script processing power of BanglaOCR by integrating the Tesseract recognition engine into BanglaOCR. We first present the complete methodology to build the combined OCR, followed by the implementation strategy. In this paper, we focus on the training data preparation process, Tesseract integration procedure and the post-processing techniques. The techniques described in this paper can be readily applied to build OCRs for other scripts as well.

Md. Abul Hasnat, Muttakinur Rahman Chowdhury, Mumi

Real-time Traffic

Document Analysis | ICDAR 2009 | Open Source | Tesseract | Tesseract Recognition Engine |

claim paper

Added	18 Feb 2011
Updated	18 Feb 2011
Type	Journal
Year	2009
Where	ICDAR
Authors	Md. Abul Hasnat, Muttakinur Rahman Chowdhury, Mumit Khan

Sciweavers

An Open Source Tesseract Based Optical Character Recognizer for Bangla Script

Document Analysis | ICDAR 2009 | Open Source | Tesseract | Tesseract Recognition Engine |

Explore & Download

Productivity Tools

Sciweavers