Subdomain Sensitive Statistical Parsing using Raw Corpora

15 years 8 months ago

Download www.lrec-conf.org

Modern statistical parsers are trained on large annotated corpora (treebanks). These treebanks usually consist of sentences addressing different subdomains (e.g. sports, politics, music), which implies that the statistics gathered by current statistical parsers are mixtures of subdomains of language use. In this paper we present a method that exploits raw subdomain corpora gathered from the web to introduce subdomain sensitivity into a given parser. We employ statistical techniques for creating an ensemble of domain sensitive parsers, and explore methods for amalgamating their predictions. Our experiments show that introducing domain sensitivity by exploiting raw corpora can improve over a tough, state-of-the-art baseline.

Barbara Plank, Khalil Sima'an

Real-time Traffic

Current Statistical Parsers | Education | LREC 2008 | Modern Statistical Parsers | Statistical Parsers |

claim paper

» Orthographic Case Restoration Using Supervised Learning Without Manual Annotation

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2008
Where	LREC
Authors	Barbara Plank, Khalil Sima'an

Comments (0)

Sciweavers

Subdomain Sensitive Statistical Parsing using Raw Corpora

Current Statistical Parsers | Education | LREC 2008 | Modern Statistical Parsers | Statistical Parsers |

Explore & Download

Productivity Tools

Sciweavers