This paper describes features and methods for document image comparison and classification at the spatial layout level. The methods are useful for visual similarity based document retrieval as well as fast algorithms for initial document type classification without OCR. A novel feature set called interval encoding is introduced to capture elements of spatial layout. This feature set encodes region layout information in fixed-length vectors which can be used for fast page layout comparison. The paper describes experiments and results to rank-order a set of document pagesin terms of their layout similarity to a test document. We also demonstrate the usefulness of the features derived from interval encoding in a hidden Markov model based page layout classification system that is trainable and extendible.
Jianying Hu, Ramanujan S. Kashi, Gordon T. Wilfong