Comparing Rating Scales and Preference Judgements in Language Evaluation

15 years 4 months ago

Download www.aclweb.org

Rating-scale evaluations are common in NLP, but are problematic for a range of reasons, e.g. they can be unintuitive for evaluators, inter-evaluator agreement and self-consistency tend to be low, and the parametric statistics commonly applied to the results are not generally considered appropriate for ordinal data. In this paper, we compare rating scales with an alternative evaluation paradigm, preferencestrength judgement experiments (PJEs), where evaluators have the simpler task of deciding which of two texts is better in terms of a given quality criterion. We present three pairs of evaluation experiments assessing text fluency and clarity for different data sets, where one of each pair of experiments is a rating-scale experiment, and the other is a PJE. We find the PJE versions of the experiments have better evaluator self-consistency and interevaluator agreement, and a larger proportion of variation accounted for by system differences, resulting in a larger number of significant d...

Anja Belz, Eric Kow

Real-time Traffic

Alternative Evaluation Paradigm | INLG 2010 | Inter-evaluator Agreement | Natural Language Processing | Ordinal Data |

claim paper

» Paraphrase Generation as Monolingual Translation Data and Evaluation

» Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights

Post Info
More Details (n/a)

Added	13 Feb 2011
Updated	13 Feb 2011
Type	Journal
Year	2010
Where	INLG
Authors	Anja Belz, Eric Kow

Comments (0)

Sciweavers

Comparing Rating Scales and Preference Judgements in Language Evaluation

Alternative Evaluation Paradigm | INLG 2010 | Inter-evaluator Agreement | Natural Language Processing | Ordinal Data |

Explore & Download

Productivity Tools

Sciweavers