Current approaches to script identification rely on hand-selected features and often require processing a significant part of the document to achieve reliable identification. We present an approach that applies a large pool of image features to a small training sample and uses subset feature selection techniques to automatically select a subset with the most discriminating power. At run time we use a classifier coupled with an evidence accumulation engine to report a script label once a preset likelihood threshold has been reached. We apply the system to a diverse corpus of printed Russian and English documents that suffer from common degradation problems. Our validation study shows promising results both in terms of the script identification accuracy and the ability to identify script on the scale of individual words and text lines.
Vitaly Ablavsky, Mark R. Stevens