Sciweavers

48 search results - page 4 / 10
» Collection statistics for fast duplicate document detection
Sort
View
KDD
2004
ACM
195views Data Mining» more  KDD 2004»
14 years 8 months ago
Improved robustness of signature-based near-replica detection via lexicon randomization
Detection of near duplicate documents is an important problem in many data mining and information filtering applications. When faced with massive quantities of data, traditional d...
Aleksander Kolcz, Abdur Chowdhury, Joshua Alspecto...
DEXAW
1999
IEEE
91views Database» more  DEXAW 1999»
13 years 12 months ago
Document Analysis Techniques for the Infinite Memory Multifunction Machine
A system that saves a digital copy of every document that users copy, print, or fax, without asking the user, has recently been proposed. Referred to as the Infinite Memory Multif...
Jonathan J. Hull, Dar-Shyang Lee, John F. Cullen, ...
SIGIR
2006
ACM
14 years 1 months ago
Near-duplicate detection by instance-level constrained clustering
For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information...
Hui Yang, James P. Callan
DAS
2008
Springer
13 years 9 months ago
A Fast Preprocessing Method for Table Boundary Detection: Narrowing Down the Sparse Lines Using Solely Coordinate Information
As the rapid growth of PDF document in digital libraries, recognizing the document structure and detecting specific document components are useful for document storage, classifica...
Ying Liu, Prasenjit Mitra, C. Lee Giles
DGO
2006
134views Education» more  DGO 2006»
13 years 9 months ago
Next steps in near-duplicate detection for eRulemaking
Large volume public comment campaigns and web portals that encourage the public to customize form letters produce many near-duplicate documents, which increases processing and sto...
Hui Yang, Jamie Callan, Stuart W. Shulman