Sciweavers

ACL
2011

Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

13 years 3 months ago
Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation
Resolving coordination ambiguity is a classic hard problem. This paper looks at coordination disambiguation in complex noun phrases (NPs). Parsers trained on the Penn Treebank are reporting impressive numbers these days, but they don’t do very well on this problem (79%). We explore systems trained using three types of corpora: (1) annotated (e.g. the Penn Treebank), (2) bitexts (e.g. Europarl), and (3) unannotated monolingual (e.g. Google N-grams). Size matters: (1) is a million words, (2) is potentially billions of words and (3) is potentially trillions of words. The unannotated monolingual data is helpful when the ambiguity can be resolved through associations among the lexical items. The bilingual data is helpful when the ambiguity can be resolved by the order of words in the translation. We train separate classifiers with monolingual and bilingual features and iteratively improve them via co-training. The co-trained classifier achieves close to 96% accuracy on Treebank data an...
Shane Bergsma, David Yarowsky, Kenneth Ward Church
Added 23 Aug 2011
Updated 23 Aug 2011
Type Journal
Year 2011
Where ACL
Authors Shane Bergsma, David Yarowsky, Kenneth Ward Church
Comments (0)