A Corpus for Comparative Evaluation of OCR Software and Postcorrection Techniques

16 years 5 days ago

Download www.cis.uni-muenchen.de

We describe a new corpus collected for comparative evaluation of OCR-software and postcorrection techniques. The corpus is freely available for academic groups and use. The major part of the corpus (2306 ﬁles) consists of Bulgarian documents. Many of these documents come with Cyrillic and Latin symbols. A smaller corpus with German documents has been added. All original documents represent real-life paper documents collected from enterprises and organizations. Most genres of written language and various document types are covered. The corpus contains the corresponding image ﬁles, rich meta-data, textual ﬁles obtained via OCR recognition, ground truth data for hundreds of example pages, and alignment software for experiments.

Stoyan Mihov, Klaus U. Schulz, Christoph Ringlstet

Real-time Traffic

Document | Document Analysis | ICDAR 2005 | Real-life Paper Documents | Smaller Corpus |

claim paper

» A Framework for Experimental Evaluation of Clustering Techniques

» Shakespeares complete works as a benchmark for evaluating multiscale document navigation t...

» An intelligent discussionbot for answering student queries in threaded discussions

Post Info
More Details (n/a)

Added	24 Jun 2010
Updated	24 Jun 2010
Type	Conference
Year	2005
Where	ICDAR
Authors	Stoyan Mihov, Klaus U. Schulz, Christoph Ringlstetter, Veselka Dojchinova, Vanja Nakova

Comments (0)

Sciweavers

A Corpus for Comparative Evaluation of OCR Software and Postcorrection Techniques

Document | Document Analysis | ICDAR 2005 | Real-life Paper Documents | Smaller Corpus |

Explore & Download

Productivity Tools

Sciweavers