There is an increasingly pressing need to develop document analysis methods that are able to cope with images of documents containing printed regions of complex shapes. Contrary to the bounding-box representation used in most past page segmentation and classification approaches which assume rectangular regions, there is a need for a more flexible description which also retains most of the functionality of the representation by rectangles. In the first part of this paper, the practical considerations of describing and handling the complexshaped regions are examined and an appropriate representation scheme is proposed. For page classification, a new approach based on the description of white space inside regions is presented. In contrast to previous page classification approaches, skewed and complex-shaped regions are handled efficiently and the features are derived with no need for time-consuming accesses of the pixel-based image data.
Apostolos Antonacopoulos, R. T. Ritchings