Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

173

ICML
2005
IEEE

110views Machine Learning» more ICML 2005»

Modeling word burstiness using the Dirichlet distribution

16 years 7 months ago

Modeling word burstiness using the Dirichlet distribution

Download cseweb.ucsd.edu

Multinomial distributions are often used to model text documents. However, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. In this paper, we propose the Dirichlet compound multinomial model (DCM) as an alternative to the multinomial. The DCM model has one additional degree of freedom, which allows it to capture burstiness. We show experimentally that the DCM is substantially better than the multinomial at modeling text data, measured by perplexity. We also show using three standard document collections that the DCM leads to better classification than the multinomial model. DCM performance is comparable to that obtained with multiple heuristic changes to the multinomial model.

Rasmus Elsborg Madsen, David Kauchak, Charles Elka

Real-time Traffic

Compound Multinomial Model | DCM Model | ICML 2005 | Machine Learning | Multinomial Distributions |

claim paper

Related Content

» Clustering documents with an exponentialfamily approximation of the Dirichlet compound mul...

» Topic models with powerlaw using PitmanYor process

» Incorporating domain knowledge into topic modeling via Dirichlet Forest priors

» Unsupervised determination of efficient Korean LVCSR units using a Bayesian Dirichlet proc...

» Crouching Dirichlet Hidden Markov Model Unsupervised POS Tagging with Context Local Tag Ge...

» Distributed Inference for Latent Dirichlet Allocation

» Deriving TFIDF as a Fisher Kernel

» Hierarchical pitmanyor language model for information retrieval

» A Bayesian Review of the PoissonDirichlet Process

Post Info
More Details (n/a)

Added	17 Nov 2009
Updated	17 Nov 2009
Type	Conference
Year	2005
Where	ICML
Authors	Rasmus Elsborg Madsen, David Kauchak, Charles Elkan

Comments (0)