We introduce and validate bootstrap techniques to compute confidence intervals that quantify the effect of test-collection variability on average precision (AP) and mean average precision (MAP) IR effectiveness measures. We consider the test collection in IR evaluation to be a representative of a population of materially similar collections, whose documents are drawn from an infinite pool with similar characteristics. Our model accurately predicts the degree of concordance between system results on randomly selected halves of the TREC-6 ad hoc corpus. We advance a framework for statistical evaluation that uses the same general framework to model other sources of chance variation as a source of input for meta-analysis techniques. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Systems and Software – performance evaluation General Terms Experimentation, Measurement Keywords bootstrap, confidence interval, precision
Gordon V. Cormack, Thomas R. Lynam