An investigation of linguistic features and clustering algorithms for topical document clustering

16 years 5 days ago

Download www.cs.columbia.edu

We investigate four hierarchical clustering methods (single-link, complete-link, groupwise-average, and single-pass) and two linguistically motivated text features (noun phrase heads and proper names) in the context of document clustering. A statistical model for combining similarity information from multiple sources is described and applied to DARPA’s Topic Detection and Tracking phase 2 (TDT2) data. This model, based on log-linear regression, alleviates the need for extensive search in order to determine optimal weights for combining input features. Through an extensive series of experiments with more than 40,000 documents from multiple news sources and modalities, we establish that both the choice of clustering algorithm

Vasileios Hatzivassiloglou, Luis Gravano, Ankineed

Real-time Traffic

DARPA’s Topic Detection | Document Clustering | Hierarchical Clustering Methods | Information Management | SIGIR 2000 |

claim paper

Post Info
More Details (n/a)

Added	01 Aug 2010
Updated	01 Aug 2010
Type	Conference
Year	2000
Where	SIGIR
Authors	Vasileios Hatzivassiloglou, Luis Gravano, Ankineedu Maganti

Comments (0)

Sciweavers

An investigation of linguistic features and clustering algorithms for topical document clustering

DARPA’s Topic Detection | Document Clustering | Hierarchical Clustering Methods | Information Management | SIGIR 2000 |

Explore & Download

Productivity Tools

Sciweavers