Sciweavers

CORR
2011
Springer

Similarity Join Size Estimation using Locality Sensitive Hashing

13 years 6 months ago
Similarity Join Size Estimation using Locality Sensitive Hashing
Similarity joins are important operations with a broad range of applications. In this paper, we study the problem of vector similarity join size estimation (VSJ). It is a generalization of the previously studied set similarity join size estimation (SSJ) problem and can handle more interesting cases such as TF-IDF vectors. One of the key challenges in similarity join size estimation is that the join size can change dramatically depending on the input similarity threshold. We propose a sampling based algorithm that uses LocalitySensitive-Hashing (LSH). The proposed algorithm LSH-SS uses an LSH index to enable effective sampling even at high thresholds. We compare the proposed technique with random sampling and the state-of-the-art technique for SSJ (adapted to VSJ) and demonstrate LSH-SS offers more accurate estimates throughout the similarity threshold range and small variance using real-world data sets.
Hongrae Lee, Raymond T. Ng, Kyuseok Shim
Added 13 May 2011
Updated 13 May 2011
Type Journal
Year 2011
Where CORR
Authors Hongrae Lee, Raymond T. Ng, Kyuseok Shim
Comments (0)