Sciweavers

KDD
2001
ACM

GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces

15 years 1 months ago
GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces
The similarity join is an important operation for mining high-dimensional feature spaces. Given two data sets, the similarity join computes all tuples (x, y) that are within a distance 6. One of the most efficient algorithms for processing similarity-joins is the Multidimensional-Spatial Join (MSJ) by Koudas and Sevcik. In our previous work -- pursued for the two-dimensional case -- we found however that MSJ has several performance shortcomings in terms of CPU and I/O cost as well as memory-requirements. Therefore, MSJ is not generally applicable to high-dimensional data. In this paper, we propose a new algorithm named Generic External Space Sweep (GESS). GESS introduces a modest rate of data replication to reduce the number of expensive distance computations. We present a new cost-model for replication, an I/O model, and an inexpensive method for duplicate removal. The principal component of our algorithm is a highly flexible replication engine. Our analytical model predicts a tremen...
Jens-Peter Dittrich, Bernhard Seeger
Added 30 Nov 2009
Updated 30 Nov 2009
Type Conference
Year 2001
Where KDD
Authors Jens-Peter Dittrich, Bernhard Seeger
Comments (0)