Efficient clustering of high-dimensional data sets with application to reference matching

14 years 5 months ago

Download www.kamalnigam.com

Many important problems involve clustering large datasets. Although naive implementations of clustering are computationally expensive, there are established efficient techniques for clustering when the dataset has either (1) a limited number of clusters, (2) a low feature dimensionality, or (3) a small number of data points. However, there has been much less work on methods of efficiently clustering datasets that are large in all three ways at once--for example, having millions of data points that exist in many thousands of dimensions representing many thousands of clusters. We present a new technique for clustering these large, highdimensional datasets. The key idea involves using a cheap, approximate distance measure to efficiently divide the data into overlapping subsets we call canopies. Then clustering is performed by measuring exact distances only between points that occur in a common canopy. Using canopies, large clustering problems that were formerly impossible become practica...

Andrew McCallum, Kamal Nigam, Lyle H. Ungar

Real-time Traffic

Data Mining | Greedy Agglomerative Clustering | KDD 2000 | Large Clustering Problems | Traditional Clustering Approach |

claim paper

Post Info
More Details (n/a)

Added	25 Aug 2010
Updated	25 Aug 2010
Type	Conference
Year	2000
Where	KDD
Authors	Andrew McCallum, Kamal Nigam, Lyle H. Ungar

Comments (0)

Sciweavers

Efficient clustering of high-dimensional data sets with application to reference matching

Data Mining | Greedy Agglomerative Clustering | KDD 2000 | Large Clustering Problems | Traditional Clustering Approach |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers