

A Similarity-based Probability Model for Latent Semantic Indexing

14 years 6 months ago
A Similarity-based Probability Model for Latent Semantic Indexing
A dual probability model is constructed for the Latent Semantic Indexing LSI using the cosine similarity measure. Both the document-document similarity matrix and the term-term similarity matrix naturally arise from the maximum likelihood estimation of the model parameters, and the optimal solutions are the latent semantic vectors of of LSI. Dimensionality reduction is justi ed by the statistical signi cance of latent semantic vectors as measured by the likelihood of the model. This leads to a statistical criterion for the optimal semantic dimensions, answering a critical open question in LSI with practical importance. Thus the model establishes a statistical framework for LSI. Ambiguities related to statistical modeling of LSI are clari ed.
Chris H. Q. Ding
Added 03 Aug 2010
Updated 03 Aug 2010
Type Conference
Year 1999
Authors Chris H. Q. Ding
Comments (0)