In this paper we compare the effectiveness scores and system rankings obtained with the inex-2002 metric, the official measure of INEX 2004, and the XCG metrics proposed in [4] and further developed here. For the comparisons, we use simulated runs as we can easily derive the desired system rankings that a reliable measure should produce based on a predefined set of user preferences. The results indicate that the XCG metrics are better suited for comparing systems for the INEX content-only (CO) task, where systems aim to return the highest scoring elements according to the user preferences reflected in a quantisation function, while also aiming to avoid returning overlapping components.
Gabriella Kazai, Mounia Lalmas, Arjen P. de Vries