Sciweavers

187 search results - page 9 / 38
» Entity categorization over large document collections
Sort
View
SIGIR
2008
ACM
13 years 7 months ago
SpotSigs: robust and efficient near duplicate detection in large web collections
Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching sig...
Martin Theobald, Jonathan Siddharth, Andreas Paepc...
IJCNN
2007
IEEE
14 years 1 months ago
Text Representations for Text Categorization: A Case Study in Biomedical Domain
— In vector space model (VSM), textual documents are represented as vectors in the term space. Therefore, there are two issues in this representation, i.e. (1) what should a term...
Man Lan, Chew Lim Tan, Jian Su, Hwee-Boon Low
SIGIR
2004
ACM
14 years 1 months ago
Parameterized generation of labeled datasets for text categorization based on a hierarchical directory
Although text categorization is a burgeoning area of IR research, readily available test collections in this field are surprisingly scarce. We describe a methodology and system (...
Dmitry Davidov, Evgeniy Gabrilovich, Shaul Markovi...
COLING
2010
13 years 2 months ago
Large Scale Parallel Document Mining for Machine Translation
A distributed system is described that reliably mines parallel text from large corpora. The approach can be regarded as cross-language near-duplicate detection, enabled by an init...
Jakob Uszkoreit, Jay Ponte, Ashok C. Popat, Moshe ...
EDBT
2004
ACM
133views Database» more  EDBT 2004»
14 years 7 months ago
HOPI: An Efficient Connection Index for Complex XML Document Collections
In this paper we present HOPI, a new connection index for XML documents based on the concept of the 2?hop cover of a directed graph introduced by Cohen et al. In contrast to most o...
Ralf Schenkel, Anja Theobald, Gerhard Weikum