Crawlers in a knowledge management system need to collect and archive documents from websites, and also track the change status of these documents. However, the existence of URL r...
Defining the boundaries of a web-site, for (say) archiving or information retrieval purposes, is an important but complicated task. In this paper a web-page clustering approach to...
We approached this line of inquiry by questioning the conventional wisdom that audit logs are too large to be analyzed and must be reduced and filtered before the data can be anal...
Data deduplication has become a popular technology for reducing the amount of storage space necessary for backup and archival data. Content defined chunking (CDC) techniques are w...
The provenance of data has recently been recognized as central to the trust one places in data. It is also important to annotation, to data integration and to probabilistic databa...