Page segmentation into text and non-text components is an essential preprocessing step before OCR operation. If this is not done properly, an OCR classification engine produces garbage text due to the presence of nontext components. This paper describes improvements to the text/image segmentation algorithm described by Bloomberg,1 which is also available in his open-source Leptonica library.2 The modifications result in significant improvements over Bloomberg’s algorithm on UW-III, UNLV, ICDAR 2009 page segmentation competition test images and circuit diagram datasets.
Syed Saqib Bukhari, Faisal Shafait, Thomas M. Breu