String transformation systems have been introduced in (Brill, 1995) and have several applications in natural language processing. In this work we consider the computational proble...
We revisit a problem introduced by Bharat and Broder almost a decade ago: how to sample random pages from the corpus of documents indexed by a search engine, using only the search...
This paper describes the participation of Columbus Project of Microsoft Research Asia (MSRA) in the GeoCLEF 2006 (a cross-language geographical retrieval track which is part of Cr...
Zhisheng Li, Chong Wang 0002, Xing Xie, Xufa Wang,...
We address the problem of measuring global quality metrics of search engines, like corpus size, index freshness, and density of duplicates in the corpus. The recently proposed est...
Abstract. Clustering is often considered the most important unsupervised learning problem and several clustering algorithms have been proposed over the years. Many of these algorit...