Statistical topic models such as the Latent Dirichlet Allocation (LDA) have emerged as an attractive framework to model, visualize and summarize large document collections in a completely unsupervised fashion. One of the limitations of this family of models is their assumption of exchangeability of words within documents, which results in a ‘bag-ofwords’ representation for documents as well as topics. As a consequence, precious information that exists in the form of correlations between words is lost in these models. In this work, we adapt recent advances in sparse modeling techniques to the problem of modeling word correlations within topics and present a new algorithm called Sparse Word Graphs. Our experiments on AP corpus reveal both long-distance and short-distance word correlations within topics that are semantically very meaningful. In addition, the new algorithm is highly scalable to large collections as it captures only the most important correlations in a sparse manner.
Ramesh Nallapati, Amr Ahmed, William W. Cohen, Eri