Sciweavers

ICDAR
2009
IEEE

Hybrid Page Layout Analysis via Tab-Stop Detection

14 years 7 months ago
Hybrid Page Layout Analysis via Tab-Stop Detection
A new hybrid page layout analysis algorithm is proposed, which uses bottom-up methods to form an initial data-type hypothesis and locate the tab-stops that were used when the page was formatted. The detected tab-stops, are used to deduce the column layout of the page. The column layout is then applied in a top-down manner to impose structure and reading-order on the detected regions. The complete C++ source code implementation is available as part of the Tesseract open source OCR engine at http://code.google.com/p/tesseract-ocr.
Raymond W. Smith
Added 21 May 2010
Updated 21 May 2010
Type Conference
Year 2009
Where ICDAR
Authors Raymond W. Smith
Comments (0)