Sciweavers

ICPR
2000
IEEE

Statistical-Based Approach to Word Segmentation

14 years 3 months ago
Statistical-Based Approach to Word Segmentation
Thispaper presents a text word extraction algorithm that takes a set of bounding boxes of glyphs and their associated text lines of a given document andpartitions the glyphs into a set of text words, using only the geometric information of the input glyphs. The algorithm isprobability based. An iterative, relaxation-like method is used tofind the partitioning solution that maximizes thejoint probability. To evaluate the petformance of our text word extraction algorithm, we used a 3-fold validation method and developed a quantitative performance measure. The algorithm was evaluated on the UW-KII database of some 1600 scanned document image pages. An area-overlap measure was used to find the correspondence between the detected entities and the ground-truth. For a total of 827,433 ground truth words, the algorithm identified and segmented 806,149words correctly, an accuracy of 97.43%.
Yalin Wang, Robert M. Haralick, Ihsin T. Phillips
Added 31 Jul 2010
Updated 31 Jul 2010
Type Conference
Year 2000
Where ICPR
Authors Yalin Wang, Robert M. Haralick, Ihsin T. Phillips
Comments (0)