Relevance assessment: are judges exchangeable and does it matter

15 years 6 months ago

Download es.csiro.au

We investigate to what extent people making relevance judgements for a reusable IR test collection are exchangeable. We consider three classes of judge: "gold standard" judges, who are topic originators and are experts in a particular information seeking task; "silver standard" judges, who are task experts but did not create topics; and "bronze standard" judges, who are those who did not define topics and are not experts in the task. Analysis shows low levels of agreement in relevance judgements between these three groups. We report on experiments to determine if this is sufficient to invalidate the use of a test collection for measuring system performance when relevance assessments have been created by silver standard or bronze standard judges. We find that both system scores and system rankings are subject to consistent but small differences across the three assessment sets. It appears that test collections are not completely robust to changes of judge ...

Peter Bailey, Nick Craswell, Ian Soboroff, Paul Th

Real-time Traffic

Information Technology | Relevance Judgements | SIGIR 2008 | Standard | Test Collections |

claim paper

» Reasoning about Beliefs Observability and Information Exchange in Teamwork

Post Info
More Details (n/a)

Added	15 Dec 2010
Updated	15 Dec 2010
Type	Journal
Year	2008
Where	SIGIR
Authors	Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries, Emine Yilmaz

Comments (0)

Sciweavers

Relevance assessment: are judges exchangeable and does it matter

Information Technology | Relevance Judgements | SIGIR 2008 | Standard | Test Collections |

Explore & Download

Productivity Tools

Sciweavers