We propose a new technique for clustering of text documents that relies on a biclustering structure constructed on terms and documents. Our approach makes use of a greedy algorith...
For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information...
Hierarchical categorization of documents is a task receiving growing interest due to the widespread proliferation of topic hierarchies for text documents. The worst problem of hie...
Word searching and indexing in historical document collections is a challenging problem because, characters in these documents are often touching or broken due to degradation/agei...
A vast amount of documents in the Web have duplicates, which is a challenge for developing efficient methods that would compute clusters of similar documents. In this paper we use ...