Sciweavers

PRICAI
2000
Springer

Text Retrieval from Document Images based on N-Gram Algorithm

14 years 3 months ago
Text Retrieval from Document Images based on N-Gram Algorithm
In this paper, we propose a method of text retrieval from document images using a similarity measure based on an N-Gram algorithm. We directly extract image features instead of using optical character recognition. Character image objects are extracted from document images based on connected components first and then an unsupervised classifier is used to classify these objects. All objects are encoded according to one unified class set and each document image is represented by one stream of object codes. Next, we retrieve N-Gram slices from these streams and build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four copora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-Gram algorithm for text documents.
Chew Lim Tan, Sam Yuan Sung, Zhaohui Yu, Yi Xu
Added 25 Aug 2010
Updated 25 Aug 2010
Type Conference
Year 2000
Where PRICAI
Authors Chew Lim Tan, Sam Yuan Sung, Zhaohui Yu, Yi Xu
Comments (0)