Decomposing background topics from keywords by principal component pursuit

15 years 6 months ago

Download perception.csl.illinois.edu

Low-dimensional topic models have been proven very useful for modeling a large corpus of documents that share a relatively small number of topics. Dimensionality reduction tools such as Principal Component Analysis or Latent Semantic Indexing (LSI) have been widely adopted for document modeling, analysis, and retrieval. In this paper, we contend that a more pertinent model for a document corpus as the combination of an (approximately) lowdimensional topic model for the corpus and a sparse model for the keywords of individual documents. For such a joint topic-document model, LSI or PCA is no longer appropriate to analyze the corpus data. We hence introduce a powerful new tool called Principal Component Pursuit that can eﬀectively decompose the low-dimensional and the sparse components of such corpus data. We give empirical results on data synthesized with a Latent Dirichlet Allocation (LDA) mode to validate the new model. We then show that for real document data analysis, the new too...

Kerui Min, Zhengdong Zhang, John Wright, Yi Ma

Real-time Traffic

CIKM 2010 | Document | Information Technology | Latent Semantic Indexing | Principal Component |

claim paper

Added	24 Jan 2011
Updated	24 Jan 2011
Type	Journal
Year	2010
Where	CIKM
Authors	Kerui Min, Zhengdong Zhang, John Wright, Yi Ma

Sciweavers

Decomposing background topics from keywords by principal component pursuit

CIKM 2010 | Document | Information Technology | Latent Semantic Indexing | Principal Component |

Explore & Download

Productivity Tools

Sciweavers