A Ground Truth Dataset for Matching Culturally Diverse Romanized Person Names

15 years 8 months ago

Download www.lrec-conf.org

This paper describes the development of a ground truth dataset of culturally diverse Romanized names in which approximately 70,000 names are matched against a subset of 700. We ran the subset as queries against the complete list using several matchers, created adjudication pools, adjudicated the results, and compiled two versions of ground truth based on different sets of adjudication guidelines and methods for resolving adjudicator conflicts. The name list, drawn from publicly available sources, was manually seeded with over 1500 name variants. These names include transliteration variation, database fielding errors, segmentation differences, incomplete names, titles, initials, abbreviations, nicknames, typos, OCR errors, and truncated data. These diverse types of matches, along with the coincidental name similarities already in the list, make possible a comprehensive evaluation of name matching systems. We have used the dataset to evaluate several open source and commercial algorithm...

Mark Arehart, Keith J. Miller

Real-time Traffic

Diverse Romanized Names | Education | Ground Truth | Ground Truth Dataset | LREC 2008 |

claim paper

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2008
Where	LREC
Authors	Mark Arehart, Keith J. Miller

Sciweavers

A Ground Truth Dataset for Matching Culturally Diverse Romanized Person Names

Diverse Romanized Names | Education | Ground Truth | Ground Truth Dataset | LREC 2008 |

Explore & Download

Productivity Tools

Sciweavers