XML documents are extremely verbose since the "schema" is repeated for every "record" in the document. While a variety of compressors are available to address ...
This paper describes a system for efficient indexing and retrieval of words in collections of document images. The proposed method is based on two main principles: unsupervised pr...
Patent document images maintained by the U.S. patent database have a specific format, in which figures and text descriptions are separated into different sections. This makes it...
Abstract-- Current twig join algorithms incur high memory costs on queries that involve child-axis nodes. In this paper we provide an analytical explanation for this phenomenon. In...
Degraded documents are frequently obtained in various situations. Examples of degraded document collections include historical document depositories, document obtained in legal an...