This paper explores factors correlating with lack of inter-annotator agreement on a word sense disambiguation (WSD) task taken from SENSEVAL-2. Twenty-seven subjects were given a series of tasks requiring word sense judgments. Subjects were asked to judge the applicability of word senses to polysemous words used in context. Metrics of lexical ability were evaluated as predictors of agreement between judges. A strong interaction effect was found for lexical ability, in which differences between levels of lexical knowledge predict disagreement. Individual levels of lexical knowledge, however, were not independently predictive of disagreement. The finding runs counter to previous assumptions regarding expert agreement on WSD annotation tasks, which in turn impacts notions of a meaningful ``gold standard'' for systems evaluation.
G. Craig Murray, Rebecca Green