We consider the problem of evaluating retrieval systems using a limited number of relevance judgments. Recent work has demonstrated that one can accurately estimate average precision via a judged pool corresponding to a relatively small random sample of documents. In this work, we demonstrate that given values or estimates of average precision, one can accurately infer the relevances of unjudged documents. Combined, we thus show how one can efficiently and accurately infer a large judged pool from a relatively small number of judged documents, thus permitting accurate and efficient retrieval evaluation on a large scale. Categories and Subject Descriptors H3.4 [Information Storage and Retrieval]: Systems and Software – Performance evaluation General Terms Theory, Measurement, Experimentation Keywords Relevance Judgments, Average Precision
Javed A. Aslam, Emine Yilmaz