Robust unsupervised segmentation of degraded document images with topic models

14 years 9 months ago

Download www.cse.buffalo.edu

Segmentation of document images remains a challenging vision problem. Although document images have a structured layout, capturing enough of it for segmentation can be difﬁcult. Most current methods combine text extraction and heuristics for segmentation, but text extraction is prone to failure and measuring accuracy remains a difﬁcult challenge. Furthermore, when presented with signiﬁcant degradation many common heuristic methods fall apart. In this paper, we propose a Bayesian generative model for document images which seeks to overcome some of these drawbacks. Our model automatically discovers different regions present in a document image in a completely unsupervised fashion. We attempt no text extraction, but rather use discrete patch-based codebook learning to make our probabilistic representation feasible. Each latent region topic is a distribution over these patch indices. We capture rough document layout with an MRF Potts model. We take an analysis-by-synthesis approach ...

Timothy J. Burns, Jason J. Corso

Real-time Traffic

Bayesian Generative Model | Computer Vision | CVPR 2009 | Document Image | Text Extraction |

claim paper

Added	04 Sep 2010
Updated	04 Sep 2010
Type	Conference
Year	2009
Where	CVPR
Authors	Timothy J. Burns, Jason J. Corso

Sciweavers

Robust unsupervised segmentation of degraded document images with topic models

Bayesian Generative Model | Computer Vision | CVPR 2009 | Document Image | Text Extraction |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers