A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

15 years 8 months ago

Download www.gatsby.ucl.ac.uk

In this paper we propose a probabilistic model for online document clustering. We use non-parametric Dirichlet process prior to model the growing number of clusters, and use a prior of general English language model as the base distribution to handle the generation of novel clusters. Furthermore, cluster uncertainty is modeled with a Bayesian Dirichletmultinomial distribution. We use empirical Bayes method to estimate hyperparameters based on a historical dataset. Our probabilistic model is applied to the novelty detection task in Topic Detection and Tracking (TDT) and compared with existing approaches in the literature.

Jian Zhang 0003, Zoubin Ghahramani, Yiming Yang

Real-time Traffic

English Language Model | NIPS 2004 | NIPS 2007 | Non-parametric Dirichlet Process | Probabilistic Model |

claim paper

» Model Based Population Tracking and Automatic Detection of Distribution Changes

» A Topic Model for Linked Documents and Update Rules for its Estimation

» A Bayesian ExemplarBased Approach to Hierarchical Shape Matching

» Clustering Text Data Streams

Post Info
More Details (n/a)

Added	31 Oct 2010
Updated	31 Oct 2010
Type	Conference
Year	2004
Where	NIPS
Authors	Jian Zhang 0003, Zoubin Ghahramani, Yiming Yang

Comments (0)

Sciweavers

A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

English Language Model | NIPS 2004 | NIPS 2007 | Non-parametric Dirichlet Process | Probabilistic Model |

Explore & Download

Productivity Tools

Sciweavers