The field of Psychometrics routinely grapples with the question of what it means to measure the inherent ability of an organism to perform a given task, and for the last forty years, the field has increasingly relied on probabilistic methods such as the Rasch model for test construction and the analysis of test results. Because the underlying issues of measuring ability apply to human language technologies as well, such probabilistic methods can be advantageously applied to the evaluation of those technologies. To test this claim, Rasch measurement was applied to the results of 67 systems participating in the Question Answering track of the 2002 Text REtrieval Conference (TREC) competition. Satisfactory model fit was obtained, and the paper illustrates the theoretical and practical strengths of Rasch scaling for evaluating systems as well as questions. Most important, simulations indicate that a test invariant metric can be defined by carrying forward 20 to 50 equating questions, thus...
Rense Lange, Juan Moran, Warren R. Greiff, Lisa Fe