Numerous approaches, including textual, structural and featural, to detecting duplicate documents have been investigated. Considering document images are usually stored and transmitted in compressed forms, it is advantageous to perform document matching directly on the compressed data. An algorithm for matching CCITT Group 4 compressed document images using a feature set directly computable from the Group 4 compression scheme is presented. Multiple descriptors based on local arrangement of feature points are constructed for efficient indexing into the database. We describe the procedures for feature extraction and descriptor generation. Performance of the algorithm on the UW database is discussed.
Dar-Shyang Lee, Jonathan J. Hull