Sciweavers

ICML
2006
IEEE

Topic modeling: beyond bag-of-words

14 years 8 months ago
Topic modeling: beyond bag-of-words
Some models of textual corpora employ text generation methods involving n-gram statistics, while others use latent topic variables inferred using the "bag-of-words" assumption, in which word order is ignored. Previously, these methods have not been combined. In this work, I explore a hierarchical generative probabilistic model that incorporates both n-gram statistics and latent topic variables by extending a unigram topic model to include properties of a hierarchical Dirichlet bigram language model. The model hyperparameters are inferred using a Gibbs EM algorithm. On two data sets, each of 150 documents, the new model exhibits better predictive accuracy than either a hierarchical Dirichlet bigram language model or a unigram topic model. Additionally, the inferred topics are less dominated by function words than are topics discovered using unigram statistics, potentially making them more meaningful.
Hanna M. Wallach
Added 17 Nov 2009
Updated 17 Nov 2009
Type Conference
Year 2006
Where ICML
Authors Hanna M. Wallach
Comments (0)