The sipping of ink through the pages of certain double-sided handwritten documents after long periods of storage poses a serious problem to human readers or OCR systems. This pape...
This paper presents a language identification technique that differentiates Latin-based languages in degraded and distorted document images. Different from the reported methods tha...
The goal of the DARPA MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Program is to automatically convert foreign language text images into Englis...
Currently an abundance of historical manuscripts, journals, and scientific notes remain largely unaccessible in library archives. Manual transcription and publication of such docu...
We argue that in general, the analysis of lexical cohesion factors in a document can drive a summarizer, as well as enable other content characterization tasks. More narrowly, thi...