Sciweavers

EACL
2006
ACL Anthology

Web Text Corpus for Natural Language Processing

14 years 25 days ago
Web Text Corpus for Natural Language Processing
Web text has been successfully used as training data for many NLP applications. While most previous work accesses web text through search engine hit counts, we created a Web Corpus by downloading web pages to create a topic-diverse collection of 10 billion words of English. We show that for context-sensitive spelling correction the Web Corpus results are better than using a search engine. For thesaurus extraction, it achieved similar overall results to a corpus of newspaper text. With many more words available on the web, better results can be obtained by collecting much larger web corpora.
Vinci Liu, James R. Curran
Added 30 Oct 2010
Updated 30 Oct 2010
Type Conference
Year 2006
Where EACL
Authors Vinci Liu, James R. Curran
Comments (0)