Scaling to Very Very Large Corpora for Natural Language Disambiguation

14 years 1 months ago

Download acl.ldc.upenn.edu

The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.

Michele Banko, Eric Brill

Real-time Traffic

ACL 2001 | ACL 2007 | Labeled Data | Natural Language | Natural Language Tasks |

claim paper

Post Info
More Details (n/a)

Added	31 Oct 2010
Updated	31 Oct 2010
Type	Conference
Year	2001
Where	ACL
Authors	Michele Banko, Eric Brill

Comments (0)

Sciweavers

Scaling to Very Very Large Corpora for Natural Language Disambiguation

ACL 2001 | ACL 2007 | Labeled Data | Natural Language | Natural Language Tasks |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers