Sciweavers

77 search results - page 5 / 16
» Pairwise Document Similarity in Large Collections with MapRe...
Sort
View
ICCS
2009
Springer
14 years 2 months ago
Frequent Itemset Mining for Clustering Near Duplicate Web Documents
A vast amount of documents in the Web have duplicates, which is a challenge for developing efficient methods that would compute clusters of similar documents. In this paper we use ...
Dmitry I. Ignatov, Sergei O. Kuznetsov
ICDE
2004
IEEE
151views Database» more  ICDE 2004»
14 years 8 months ago
Improved File Synchronization Techniques for Maintaining Large Replicated Collections over Slow Networks
We study the problem of maintaining large replicated collections of files or documents in a distributed environment with limited bandwidth. This problem arises in a number of impo...
Torsten Suel, Patrick Noel, Dimitre Trendafilov
SIGIR
2008
ACM
13 years 7 months ago
SpotSigs: robust and efficient near duplicate detection in large web collections
Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching sig...
Martin Theobald, Jonathan Siddharth, Andreas Paepc...
DIAL
2006
IEEE
167views Image Analysis» more  DIAL 2006»
14 years 1 months ago
Tree clustering for layout-based document image retrieval
We describe a system for the retrieval on the basis of layout similarity of document images belonging to collections stored in digital libraries. Layout regions are extracted and ...
Simone Marinai, Emanuele Marino, Giovanni Soda
BMCBI
2008
80views more  BMCBI 2008»
13 years 7 months ago
Towards an automatic classification of protein structural domains based on structural similarity
Background: Formal classification of a large collection of protein structures aids the understanding of evolutionary relationships among them. Classifications involving manual ste...
Vichetra Sam, Chin-Hsien Tai, Jean Garnier, Jean-F...