Evaluation of sentiment analysis, like large-scale IR evaluation, relies on the accuracy of human assessors to create judgments. Subjectivity in judgments is a problem for relevance assessment and even more so in the case of sentiment annotations. In this study we examine the degree to which assessors agree upon sentence-level sentiment annotation. We show that inter-assessor agreement is not contingent on document length or frequency of sentiment but correlates positively with automated opinion retrieval performance. We also examine the individual annotation categories to determine which categories pose most difficulty for annotators. Categories and Subject Descriptors H.3.4 [Information Retrieval]: Systems and Software Performance evaluation (efficiency and effectiveness) General Terms Experimentation, Measurement, Human Factors
Adam Bermingham, Alan F. Smeaton