Automated detection of the first document reporting each new event in temporally-sequenced streams of documents is an open challenge. In this paper we propose a new approach which addresses this problem in two stages: 1) using a supervised learning algorithm to classify the on-line document stream into pre-defined broad topic categories, and 2) performing topic-conditioned novelty detection for documents in each topic. We also focus on exploiting named-entities for event-level novelty detection and using feature-based heuristics derived from the topic histories. Evaluating these methods using a set of broadcast news stories, our results show substantial performance gains over the traditional one-level approach to the novelty detection problem. Categories and Subject Descriptors I.5.2 [Design Methodology]: Classifier design and evaluation; Feature evaluation and selection; Pattern analysis;; H.3.3 [Information Search and Retrieval]: Information filtering General Terms Design, Experimen...
Yiming Yang, Jian Zhang, Jaime G. Carbonell, Chun