Information retrieval systems are evaluated against test collections of topics, documents, and assessments of which documents are relevant to which topics. Documents are chosen for relevance assessment by pooling runs from a set of existing systems. New systems can return unassessed documents, leading to an evaluation bias against them. In this paper, we propose to estimate the degree of bias against an unpooled system, and to adjust the system’s score accordingly. Bias estimation can be done via leave-one-out experiments on the existing, pooled systems, but this requires the problematic assumption that the new system is similar to the existing ones. Instead, we propose that all systems, new and pooled, be fully assessed against a common set of topics, and the bias observed against the new system on the common topics be used to adjust scores on the existing topics. We demonstrate using resampling experiments on TREC test sets that our method leads to a marked reduction in error, eve...
William Webber, Laurence A. F. Park