A distributed system is described that reliably mines parallel text from large corpora. The approach can be regarded as cross-language near-duplicate detection, enabled by an init...
Jakob Uszkoreit, Jay Ponte, Ashok C. Popat, Moshe ...
Main approaches to corpus-based semantic class mining include distributional similarity (DS) and pattern-based (PB). In this paper, we perform an empirical comparison of them, bas...
This paper presents an application of PageRank, a random-walk model originally devised for ranking Web search results, to ranking WordNet synsets in terms of how strongly they pos...
We propose a data-mining approach that produces generalized query patterns (with generalized keywords) from the raw user logs of the Microsoft Encarta search engine (http://encart...
Charles X. Ling, Jianfeng Gao, Huajie Zhang, Weini...
User clicks on a URL in response to a query are extremely useful predictors of the URL's relevance to that query. Exact match click features tend to suffer from severe data s...
Huihsin Tseng, Longbin Chen, Fan Li, Ziming Zhuang...