Sciweavers

103 search results - page 5 / 21
» Models and Algorithms for Duplicate Document Detection
Sort
View
SIGIR
2008
ACM
13 years 7 months ago
Local text reuse detection
Text reuse occurs in many different types of documents and for many different reasons. One form of reuse, duplicate or near-duplicate documents, has been a focus of researchers be...
Jangwon Seo, W. Bruce Croft
SIGIR
2000
ACM
13 years 11 months ago
An investigation of linguistic features and clustering algorithms for topical document clustering
We investigate four hierarchical clustering methods (single-link, complete-link, groupwise-average, and single-pass) and two linguistically motivated text features (noun phrase he...
Vasileios Hatzivassiloglou, Luis Gravano, Ankineed...
ICPR
2008
IEEE
14 years 1 months ago
A robust front page detection algorithm for large periodical collections
Large-scale digitization projects aimed at periodicals often have as input streams of completely unlabeled document images. In such situations, the results produced by the automat...
Iuliu Vasile Konya, Christoph Seibert, Sebastian G...
MSR
2011
ACM
12 years 10 months ago
Modeling the evolution of topics in source code histories
Studying the evolution of topics (collections of co-occurring words) in a software project is an emerging technique to automatically shed light on how the project is changing over...
Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, Do...
ICDAR
2009
IEEE
13 years 5 months ago
Clutter Noise Removal in Binary Document Images
The paper presents a clutter detection and removal algorithm for complex document images. The distance transform based approach is independent of clutter's position, size, sh...
Mudit Agrawal, David S. Doermann