

Name-ethnicity classification from open sources

15 years 26 days ago
Name-ethnicity classification from open sources
The problem of ethnicity identification from names has a variety of important applications, including biomedical research, demographic studies, and marketing. Here we report on the development of an ethnicity classifier where all training data is extracted from public, non-confidential (and hence somewhat unreliable) sources. Our classifier uses hidden Markov models (HMMs) and decision trees to classify names into 13 cultural/ethnic groups with individual group accuracy comparable accuracy to earlier binary (e.g., Spanish/non-Spanish) classifiers. We have applied this classifier to over 20 million names from a large-scale news corpus, identifying interesting temporal and spatial trends on the representation of particular cultural/ethnic groups. Categories and Subject Descriptors I.2.1 [Applications and Expert Systems]: Cartography General Terms Algorithms, Experimentation Keywords ethnicity detection, name classification, news analysis, social science research
Anurag Ambekar, Charles B. Ward, Jahangir Mohammed
Added 25 Nov 2009
Updated 25 Nov 2009
Type Conference
Year 2009
Where KDD
Authors Anurag Ambekar, Charles B. Ward, Jahangir Mohammed, Steven Skiena, Swapna Male
Comments (0)