Improved Unsupervised Name Discrimination with Very Wide Bigrams and Automatic Cluster Stopping

15 years 6 months ago

Download www.d.umn.edu

We cast name discrimination as a problem in clustering short contexts. Each occurrence of an ambiguous name is treated independently, and represented using second?order context vectors. We calibrate our approach using a manually annotated collection of five ambiguous names from the Web, and then apply the learned parameter settings to three held-out sets of pseudo-name data that have been reported on in previous publications. We find that significant improvements in the accuracy of name discrimination can be achieved by using very wide bigrams, which are ordered pairs of words with up to 48 intervening words between them. We also show that recent developments in automatic cluster stopping can be used to predict the number of underlying identities without any significant loss of accuracy as compared to previous approaches which have set these values manually.

Ted Pedersen

Real-time Traffic

Ambiguous Names | CICLING 2009 | Natural Language Processing | Significant Improvements |

claim paper

Post Info
More Details (n/a)

Added	24 Nov 2009
Updated	24 Nov 2009
Type	Conference
Year	2009
Where	CICLING
Authors	Ted Pedersen

Comments (0)

Sciweavers

Improved Unsupervised Name Discrimination with Very Wide Bigrams and Automatic Cluster Stopping

Ambiguous Names | CICLING 2009 | Natural Language Processing | Significant Improvements |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers