Analysing Wikipedia and Gold-Standard Corpora for NER Training

16 years 7 months ago

Download www.physics.usyd.edu.au

Named entity recognition (NER) for English typically involves one of three gold standards: MUC, CoNLL, or BBN, all created by costly manual annotation. Recent work has used Wikipedia to automatically create a massive corpus of named entity annotated text. We present the first comprehensive crosscorpus evaluation of NER. We identify the causes of poor cross-corpus performance and demonstrate ways of making them more compatible. Using our process, we develop a Wikipedia corpus which outperforms gold standard corpora on crosscorpus evaluation by up to 11%.

Joel Nothman, Tara Murphy, James R. Curran

Real-time Traffic

Costly Manual Annotation | EACL 2009 | Entity Annotated Text | Gold Standard Corpora | Natural Language Processing |

claim paper

Added	24 Nov 2009
Updated	24 Nov 2009
Type	Conference
Year	2009
Where	EACL
Authors	Joel Nothman, Tara Murphy, James R. Curran

Sciweavers

Analysing Wikipedia and Gold-Standard Corpora for NER Training

Costly Manual Annotation | EACL 2009 | Entity Annotated Text | Gold Standard Corpora | Natural Language Processing |

Explore & Download

Productivity Tools

Sciweavers