Abstract. This paper investigates the effect of performance measures and relevance functions in comparing retrieval systems in INEX, an evaluation forum dedicated to XML retrieval. We focus on two interdependent challenges which arise when evaluating XML retrieval systems, namely weak ordering issue of retrieved lists and multivalued relevance scales. Our analysis provides empirical evidence about the reasonableness of popular assumptions in information retrieval (IR) evaluation which state that ties can be ignored and binary relevance is sufficient. We also shed light on the impact of a parameter in Q-measure [18] on the sensitivity of the metric.