Sciweavers

290 search results - page 49 / 58
» Document normalization revisited
Sort
View
CIKM
2011
Springer
12 years 7 months ago
Probabilistic near-duplicate detection using simhash
This paper offers a novel look at using a dimensionalityreduction technique called simhash [8] to detect similar document pairs in large-scale collections. We show that this algo...
Sadhan Sood, Dmitri Loguinov
CACM
2006
102views more  CACM 2006»
13 years 7 months ago
Infoglut
whose titles and abstracts sound very interesting, the pile of unread reports continues to grow on the table in my office." (How quaint the terminology: mail and electronic me...
Peter J. Denning
SIGIR
2002
ACM
13 years 7 months ago
Empirical studies in strategies for Arabic retrieval
This work evaluates a few search strategies for Arabic monolingual and cross-lingual retrieval, using the TREC Arabic corpus as the test-bed. The release by NIST in 2001 of an Ara...
Jinxi Xu, Alexander Fraser, Ralph M. Weischedel
SIGIR
2010
ACM
13 years 11 months ago
Estimation of statistical translation models based on mutual information for ad hoc information retrieval
As a principled approach to capturing semantic relations of words in information retrieval, statistical translation models have been shown to outperform simple document language m...
Maryam Karimzadehgan, ChengXiang Zhai
DRR
2003
13 years 8 months ago
Correcting OCR text by association with historical datasets
The Medical Article Records System (MARS) developed by the Lister Hill National Center for Biomedical Communications uses scanning, OCR and automated recognition and reformatting ...
Susan E. Hauser, Jonathan Schlaifer, Tehseen F. Sa...