Several IR tasks rely, to achieve high efficiency, on a single pervasive data structure called the inverted index. This is a mapping from the terms in a text collection to the docu...
Abstract. In this paper we describe a methodology for harvesting information from large distributed repositories (e.g. large Web sites) with minimum user intervention. The methodol...
Fabio Ciravegna, Sam Chapman, Alexiei Dingli, Yori...
In recent years, Latent Semantic Indexing (LSI) has been recognized as an effective tool for Information Retrieval in text documents. The level of "granularity" in LSI (...
This paper considers the problem of identifying on the Web compound documents (cDocs) ? groups of web pages that in aggregate constitute semantically coherent information entities...
While several hierarchical classification methods have been applied to web content, such techniques invariably rely on a pre-defined taxonomy of documents. We propose a new techni...