The Web as a Baseline: Evaluating the Performance of Unsupervised Web-based Models for a Range of NLP Tasks

15 years 3 months ago

Download homepages.inf.ed.ac.uk

Previous work demonstrated that web counts can be used to approximate bigram frequencies, and thus should be useful for a wide variety of NLP tasks. So far, only two generation tasks (candidate selection for machine translation and confusion-set disambiguation) have been tested using web-scale data sets. The present paper investigates if these results generalize to tasks covering both syntax and semantics, both generation and analysis, and a larger range of n-grams. For the majority of tasks, we find that simple, unsupervised models perform better when n-gram frequencies are obtained from the web rather than from a large corpus. However, in most cases, web-based models fail to outperform more sophisticated state-of-theart models trained on small corpora. We argue that web-based models should therefore be used as a baseline for, rather than an alternative to, standard models.

Mirella Lapata, Frank Keller

Real-time Traffic

Generation Tasks | NAACL 2004 | NAACL 2007 | NLP Tasks | Web-based Models |

claim paper

Post Info
More Details (n/a)

Added	31 Oct 2010
Updated	31 Oct 2010
Type	Conference
Year	2004
Where	NAACL
Authors	Mirella Lapata, Frank Keller

Comments (0)

Sciweavers

The Web as a Baseline: Evaluating the Performance of Unsupervised Web-based Models for a Range of NLP Tasks

Generation Tasks | NAACL 2004 | NAACL 2007 | NLP Tasks | Web-based Models |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers