Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

14 years 10 months ago

Download www.clsp.jhu.edu

Resolving coordination ambiguity is a classic hard problem. This paper looks at coordination disambiguation in complex noun phrases (NPs). Parsers trained on the Penn Treebank are reporting impressive numbers these days, but they don’t do very well on this problem (79%). We explore systems trained using three types of corpora: (1) annotated (e.g. the Penn Treebank), (2) bitexts (e.g. Europarl), and (3) unannotated monolingual (e.g. Google N-grams). Size matters: (1) is a million words, (2) is potentially billions of words and (3) is potentially trillions of words. The unannotated monolingual data is helpful when the ambiguity can be resolved through associations among the lexical items. The bilingual data is helpful when the ambiguity can be resolved by the order of words in the translation. We train separate classiﬁers with monolingual and bilingual features and iteratively improve them via co-training. The co-trained classiﬁer achieves close to 96% accuracy on Treebank data an...

Shane Bergsma, David Yarowsky, Kenneth Ward Church

Real-time Traffic

ACL 2011 | Computational Linguistics | Noun Phrases | Penn Treebank | Size Matters 1 |

claim paper

Added	23 Aug 2011
Updated	23 Aug 2011
Type	Journal
Year	2011
Where	ACL
Authors	Shane Bergsma, David Yarowsky, Kenneth Ward Church

Sciweavers

Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

ACL 2011 | Computational Linguistics | Noun Phrases | Penn Treebank | Size Matters 1 |

Explore & Download

Productivity Tools

Sciweavers