Sciweavers

SDM
2007
SIAM

Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning

14 years 26 days ago
Topic Models over Text Streams: A Study of Batch and Online Unsupervised Learning
Topic modeling techniques have widespread use in text data mining applications. Some applications use batch models, which perform clustering on the document collection in aggregate. In this paper, we analyze and compare the performance of three recently-proposed batch topic models—Latent Dirichlet Allocation (LDA), Dirichlet Compound Multinomial (DCM) mixtures and von-Mises Fisher (vMF) mixture models. In cases where offline clustering on complete document collections is infeasible due to resource and response-rate constraints, online unsupervised clustering methods that process incoming data incrementally are necessary. To this end, we propose online variants of vMF, EDCM and LDA. Experiments on large real-world document collections, in both the offline and online settings, demonstrate that though LDA is a good model for finding word-level topics, vMF finds better document-level topic clusters more efficiently, which is often important in text mining applications. Finally, we pro...
Arindam Banerjee, Sugato Basu
Added 30 Oct 2010
Updated 30 Oct 2010
Type Conference
Year 2007
Where SDM
Authors Arindam Banerjee, Sugato Basu
Comments (0)