Using term informativeness for named entity detection

15 years 7 months ago

Download people.csail.mit.edu

Informal communication (e-mail, bulletin boards) poses a diﬃcult learning environment because traditional grammatical and lexical information are noisy. Other information is necessary for tasks such as named entity detection. How topic-centric, or informative, a word is can be valuable information. It is well known that informative words are best modeled by “heavy-tailed” distributions, such as mixture models. However, informativeness scores do not take full advantage of this fact. We introduce a new informativeness score that directly utilizes mixture model likelihood to identify informative words. We use the task of extracting restaurant names from bulletin board posts as a way to determine eﬀectiveness. We ﬁnd that our “mixture score” is weakly eﬀective alone and highly eﬀective when combined with Inverse Document Frequency. We compare against other informativeness criteria and ﬁnd that only Residual IDF is competitive against our combined IDF/Mixture score. Cat...

Jason D. M. Rennie, Tommi Jaakkola

Real-time Traffic