Large scale learning is often realistic only in a semi-supervised setting where a small set of labeled examples is available together with a large collection of unlabeled data. In...
Modern distributed information retrieval techniques require accurate knowledge of collection size. In non-cooperative environments, where detailed collection statistics are not av...
Text message stream is a newly emerging type of Web data which is produced in enormous quantities with the popularity of Instant Messaging and Internet Relay Chat. It is benefici...
The performance of document clustering systems depends on employing optimal text representations, which are not only difficult to determine beforehand, but also may vary from one ...
Traditional web link-based ranking schemes use a single score to measure a page’s authority without concern of the community from which that authority is derived. As a result, a...
We present a novel language modeling approach to capturing the query reformulation behavior of Web search users. Based on a framework that categorizes eight different types of “...
Large-scale web and text retrieval systems deal with amounts of data that greatly exceed the capacity of any single machine. To handle the necessary data volumes and query through...
Similarity measures for text have historically been an important tool for solving information retrieval problems. In many interesting settings, however, documents are often closel...