Sciweavers

CIKM
2008
Springer

Modeling hidden topics on document manifold

14 years 2 months ago
Modeling hidden topics on document manifold
Topic modeling has been a key problem for document analysis. One of the canonical approaches for topic modeling is Probabilistic Latent Semantic Indexing, which maximizes the joint probability of documents and terms in the corpus. The major disadvantage of PLSI is that it estimates the probability distribution of each document on the hidden topics independently and the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting. Latent Dirichlet Allocation (LDA) is proposed to overcome this problem by treating the probability distribution of each document over topics as a hidden random variable. Both of these two methods discover the hidden topics in the Euclidean space. However, there is no convincing evidence that the document space is Euclidean, or flat. Therefore, it is more natural and reasonable to assume that the document space is a manifold, either linear or nonlinear. In this paper, we consider the problem of ...
Deng Cai, Qiaozhu Mei, Jiawei Han, Chengxiang Zhai
Added 12 Oct 2010
Updated 12 Oct 2010
Type Conference
Year 2008
Where CIKM
Authors Deng Cai, Qiaozhu Mei, Jiawei Han, Chengxiang Zhai
Comments (0)