We consider the problem of evaluating retrieval systems using incomplete judgment information. Buckley and Voorhees recently demonstrated that retrieval systems can be efficiently and effectively evaluated using incomplete judgments via the bpref measure [6]. When relevance judgments are complete, the value of bpref is an approximation to the value of average precision using complete judgments. However, when relevance judgments are incomplete, the value of bpref deviates from this value, though it continues to rank systems in a manner similar to average precision evaluated with a complete judgment set. In this work, we propose three evaluation measures that (1) are approximations to average precision even when the relevance judgments are incomplete and (2) are more robust to incomplete or imperfect relevance judgments than bpref. The proposed estimates of average precision are simple and accurate, and we demonstrate the utility of these estimates using TREC data. Categories and Subjec...
Emine Yilmaz, Javed A. Aslam