Sciweavers

LREC
2008

Subdomain Sensitive Statistical Parsing using Raw Corpora

14 years 1 months ago
Subdomain Sensitive Statistical Parsing using Raw Corpora
Modern statistical parsers are trained on large annotated corpora (treebanks). These treebanks usually consist of sentences addressing different subdomains (e.g. sports, politics, music), which implies that the statistics gathered by current statistical parsers are mixtures of subdomains of language use. In this paper we present a method that exploits raw subdomain corpora gathered from the web to introduce subdomain sensitivity into a given parser. We employ statistical techniques for creating an ensemble of domain sensitive parsers, and explore methods for amalgamating their predictions. Our experiments show that introducing domain sensitivity by exploiting raw corpora can improve over a tough, state-of-the-art baseline.
Barbara Plank, Khalil Sima'an
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2008
Where LREC
Authors Barbara Plank, Khalil Sima'an
Comments (0)