One-Class Clustering in the Text Domain

14 years 3 months ago

Download www.cs.umass.edu

Having seen a news title "Alba denies wedding reports", how do we infer that it is primarily about Jessica Alba, rather than about weddings or reports? We probably realize that, in a randomly driven sentence, the word "Alba" is less anticipated than "wedding" or "reports", which adds value to the word "Alba" if used. Such anticipation can be modeled as a ratio between an empirical probability of the word (in a given corpus) and its estimated probability in general English. Aggregated over all words in a document, this ratio may be used as a measure of the document's topicality. Assuming that the corpus consists of on-topic and off-topic documents (we call them the core and the noise), our goal is to determine which documents belong to the core. We propose two unsupervised methods for doing this. First, we assume that words are sampled i.i.d., and propose an information-theoretic framework for determining the core. Second, we relax...

Ron Bekkerman, Koby Crammer

Real-time Traffic

EMNLP 2008 | Empirical Probability | Jessica Alba | Natural Language Processing | Such Anticipation |

claim paper

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2008
Where	EMNLP
Authors	Ron Bekkerman, Koby Crammer

Comments (0)

Sciweavers

One-Class Clustering in the Text Domain

EMNLP 2008 | Empirical Probability | Jessica Alba | Natural Language Processing | Such Anticipation |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers