Question answering systems increasingly need to deal with complex information needs that require more than simple factoid answers. The evaluation of such systems is usually carried out using precision- or recall-based system performance metrics. Previous work has demonstrated that when users are shown two search result lists side-by-side, they can reliably differentiate between the qualities of the lists. We investigate the consistency between this user-based approach and system-oriented metrics in the question answering environment. Our initial results indicate that the two methodologies show a high level of disagreement. Categories and Subject Descriptors H.3.4 [Information Storage and Retrieval]: Systems and Software --- Performance evaluation General Terms Experimentation, Human Factors, Performance Keywords TREC, ciQA, human preference and judgement.