The PDF format is commonly used for the exchange of documents on the Web and there is a growing need to understand and extract or repurpose data held in PDF documents. Many system...
---- The traditional web caching is currently limited to static documents only. A page generated on the fly from a server side script may have different contents on different acces...
This paper discusses generating document structure from annotated media repositories in a domain-independent manner. This approaches the vision of a universal RDF browser. We star...
Lloyd Rutledge, Jacco van Ossenbruggen, Lynda Hard...
We study the problem of creating highly compressed fulltext index structures for versioned document collections, that is, collections that contain multiple versions of each docume...
In this paper, we present an analysis based on linguistic and typographic features that allows for the identification of titles in web documents. We focus in particular on procedu...