Evaluating competing technologies on a common problem set is a powerful way to improve the state of the art and hasten technology transfer. Yet poorly designed evaluations can waste research effort or even mislead researchers with faulty conclusions. Thus it is important to examine the quality of a new evaluation task to establish its reliability. This paper provides an example of one such assessment by analyzing the task within the TREC 2002 question answering track. The analysis demonstrates that comparative results from the new task are stable, and empirically estimates the size of the difference required between scores to confidently conclude that two runs are different. Metric-based evaluations of human language technology such as MUC and TREC and DUC continue to proliferate (Sparck Jones, 2001). This proliferation is not difficult to understand: evaluations can forge communities, accelerate technology transfer, and advance the state of the art. Yet evaluations are not without ...
Ellen M. Voorhees