Sciweavers

DOCENG
2010
ACM

Picture detection in document page images

14 years 16 days ago
Picture detection in document page images
We present a method for picture detection in document page images, which can come from scanned or camera images, or rendered from electronic file formats. Our method uses OCR to separate out the text and applies the Normalized Cuts algorithm to cluster the non-text pixels into picture regions. A refinement step uses the captions found in the OCR text to deduce how many pictures are in a picture region, thereby correcting for under- and over-segmentation. A performance evaluation scheme is applied which takes into account the detection quality and fragmentation quality. We benchmark our method against the ABBYY application on page images from conference papers. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing. General Terms Algorithms, Performance. Keywords Picture detection, page image, entity extraction, OCR, document image analysis.
Patrick Chiu, Francine Chen, Laurent Denoue
Added 08 Nov 2010
Updated 08 Nov 2010
Type Conference
Year 2010
Where DOCENG
Authors Patrick Chiu, Francine Chen, Laurent Denoue
Comments (0)