Further Meta-Evaluation of Broad-Coverage Surface Realization

15 years 10 days ago

Download aclweb.org

We present the first evaluation of the utility of automatic evaluation metrics on surface realizations of Penn Treebank data. Using outputs of the OpenCCG and XLE realizers, along with ranked WordNet synonym substitutions, we collected a corpus of generated surface realizations. These outputs were then rated and post-edited by human annotators. We evaluated the realizations using seven automatic metrics, and analyzed correlations obtained between the human judgments and the automatic scores. In contrast to previous NLG meta-evaluations, we find that several of the metrics correlate moderately well with human judgments of both adequacy and fluency, with the TER family performing best overall. We also find that all of the metrics correctly predict more than half of the significant systemlevel differences, though none are correct in all cases. We conclude with a discussion of the implications for the utility of such metrics in evaluating generation in the presence of variation. A further...

Dominic Espinosa, Rajakrishnan Rajkumar, Michael W

Real-time Traffic

Automatic Evaluation Metrics | EMNLP 2010 | Human Judgments | Natural Language Processing | Surface Realizations |

claim paper

Post Info
More Details (n/a)

Added	11 Feb 2011
Updated	11 Feb 2011
Type	Journal
Year	2010
Where	EMNLP
Authors	Dominic Espinosa, Rajakrishnan Rajkumar, Michael White, Shoshana Berleant

Comments (0)

Sciweavers

Further Meta-Evaluation of Broad-Coverage Surface Realization

Automatic Evaluation Metrics | EMNLP 2010 | Human Judgments | Natural Language Processing | Surface Realizations |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers