A Topic-Based Measure of Resource Description Quality for Distributed Information Retrieval

16 years 3 months ago

Download bradipo.net

The aim of query-based sampling is to obtain a suﬃcient, representative sample of an underlying (text) collection. Current measures for assessing sample quality are too coarse grain to be informative. This paper outlines a measure of ﬁner granularity based on probabilistic topic models of text. The assumption we make is that a representative sample should capture the broad themes of the underlying text collection. If these themes are not captured, then resource selection will be aﬀected in terms of performance, coverage and reliability. For example, resource selection algorithms that require extrapolation from a small sample of indexed documents to determine which collections are most likely to hold relevant documents may be aﬀected by samples which do not reﬂect the topical density of a collection. To address this issue we propose to measure the relative entropy between topics obtained in a sample with respect to the complete collection. Topics are both modelled from the col...

Mark Baillie, Mark James Carman, Fabio Crestani

Real-time Traffic