clustering of documents according to sharing of topics at multiple levels of abstraction. Given a corpus of documents, a posterior inference algorithm finds an approximation to a posterior distribution over trees, topics and allocations of words to levels of the tree. We demonstrate this m on collections of scientific abstracts from several journals. This model exemplifies a recent trend in statistical machine learning—the use of Bayesian nonparametric methods to infer distributions on flexible data structures. Categories and Subject Descriptors: G.3 [PROBABILITY AND STATISTICS]: Stochastic processes; I.2.7 [ARTIFICIAL INTELLIGENCE]: Text analysis General Terms: Algorithms, Experimentation Additional Key Words and Phrases: Bayesian nonparametric statistics, Unsupervised learning (To appear in the Journal of the ACM)
David M. Blei, Thomas L. Griffiths, Michael I. Jor