Because of name variations, an author may have multiple names and multiple authors may share the same name. Such name ambiguity affects the performance of document retrieval, web search, database integration, and may cause improper attribution to authors. This paper presents a hierarchical naive Bayes mixture model, an unsupervised learning approach, for name disambiguation in author citations. This method partitions a collection of citations 1 into clusters, with each cluster containing only citations authored by the same author, thus disambiguating authorship in citations to induce author name identities. Three types of citation features are used: co-author names, paper title words, and journal or proceeding title words. The approach is illustrated with 16 name datasets that are constructed based on the publication lists collected from author homepages and DBLP computer science bibliography. Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retri...
Hui Han, Wei Xu, Hongyuan Zha, C. Lee Giles