The structural features of XML components are an extra source of information that should be used in a contentoriented retrieval task on this type of documents. This paper explores...
This paper considers the problem of identifying on the Web compound documents (cDocs) ? groups of web pages that in aggregate constitute semantically coherent information entities...
Successful applications of digital libraries require structured access to sources of information. This paper presents an approach to extract the logical structure of text document...
While the information resources on the Web are vast, the sources are often hard to find, painful to use, and difficult to integrate. We have developed the Heracles framework for b...
In this paper we introduce a programming language for Web document processing called WebL. WebL is a high level, object-oriented scripting language that incorporates two novel fea...