Thispaper presents a text word extraction algorithm that takes a set of bounding boxes of glyphs and their associated text lines of a given document andpartitions the glyphs into a set of text words, using only the geometric information of the input glyphs. The algorithm isprobability based. An iterative, relaxation-like method is used tofind the partitioning solution that maximizes thejoint probability. To evaluate the petformance of our text word extraction algorithm, we used a 3-fold validation method and developed a quantitative performance measure. The algorithm was evaluated on the UW-KII database of some 1600 scanned document image pages. An area-overlap measure was used to find the correspondence between the detected entities and the ground-truth. For a total of 827,433 ground truth words, the algorithm identified and segmented 806,149words correctly, an accuracy of 97.43%.
Yalin Wang, Robert M. Haralick, Ihsin T. Phillips