Sciweavers

ICDAR
2009
IEEE

Automated Ground Truth Data Generation for Newspaper Document Images

14 years 6 months ago
Automated Ground Truth Data Generation for Newspaper Document Images
In document image understanding, public datasets with ground-truth are an important part of scientific work. They are not only helpful for developing new methods, but also provide a way of comparing performance. Generating these datasets, however, is time consuming and cost-intensive work, requiring a lot of manual effort. In this paper we both propose a way to semi-automatically generate groundtruthed datasets for newspapers and provide a comprehensive dataset. The focus of this paper is layout analysis ground truth. The proposed two step approach consists of a module which automatically creates layouts and an image matching module which allows to map the ground truth information from the synthetic layout to the scanned version. In the first step, layouts are generated automatically from a news corpus. The output consists of a digital newspaper (PDF file) and an XML file containing geometric and logical layout information. In the second step, the PDF files are printed, scanned a...
Thomas Strecker, Joost van Beusekom, Sahin Albayra
Added 21 May 2010
Updated 21 May 2010
Type Conference
Year 2009
Where ICDAR
Authors Thomas Strecker, Joost van Beusekom, Sahin Albayrak, Thomas M. Breuel
Comments (0)