We report an improved methodology for training a sequence of classifiers for document image content extraction, that is, the location and segmentation of regions containing handwr...
Patent document images maintained by the U.S. patent database have a specific format, in which figures and text descriptions are separated into different sections. This makes it d...
HTML has popularized the use of style sheets, and the advent of XML has stressed the importance of style as a key area complementing document structure and content. A number of to...
Separating machine printed text and handwriting from overlapping text is a challenging problem in the document analysis field and no reliable algorithms have been developed thus f...
Automated extraction of bibliographic information from journal articles is key to the affordable creation and maintenance of citation databases, such as MEDLINE
Xiaoli Zhang, Jie Zou, Daniel X. Le, George R. Tho...
Information Retrieval systems are limited by the linguistic variation of language. The use of Natural Language Processing techniques to manage this problem has been studied for a ...
This paper presents an iterative method for generative semantic clustering of related information elements in spatial hypertext documents. The goal is to automatically organize th...
Andruid Kerne, Eunyee Koh, Vikram Sundaram, J. Mic...
Document representations can rapidly become unwieldy if they try to encapsulate all possible document properties, ranging tract structure to detailed rendering and layout. We pres...