Different from familiar clustering objects, text documents have sparse data spaces. A common way of representing a document is as a bag of its component words, but the semantic re...
XML documents are frequently used in applications such as business transactions and medical records involving sensitive information. Typically, parts of documents should be visibl...
Naizhen Qi, Michiharu Kudo, Jussi Myllymaki, Hamid...
: With the increasing popularity of semi-structured documents (particularly in the form of XML) for knowledge management, it is important to create tools that use the additional in...
Knowledge workers use paper extensively for document reviewing and note-taking due to its versatility and simplicity of use. As users annotate printed documents and gather notes, ...
Context influences the search process, but to date research has not definitively identified which aspects of context are the most influential for information retrieval, and thus a...
Luanne Freund, Elaine G. Toms, Charles L. A. Clark...
Page segmentation algorithms found in published literatures often rely on some predetermined parameters such as general font sizes, distances between text lines and document scan ...
We describe a new corpus collected for comparative evaluation of OCR-software and postcorrection techniques. The corpus is freely available for academic groups and use. The major ...
Stoyan Mihov, Klaus U. Schulz, Christoph Ringlstet...
This paper describes the development of a new document ranking system based on layout similarity. The user has a need represented by a set of ”wanted” documents, and the syste...
May Huang, Daniel DeMenthon, David S. Doermann, Ly...
We present a document understanding system in which the arrangement of lines of text and block separators within a document are modeled by stochastic context free grammars. A gram...
John C. Handley, Anoop M. Namboodiri, Richard Zani...
— We present a general approach for the hierarchical segmentation and labeling of document layout structures. This approach models document layout as a grammar and performs a glo...