Joining Massive High-Dimensional Datasets

16 years 8 months ago

Download www.cise.ufl.edu

We consider the problem of joining massive datasets. We propose two techniques for minimizing disk I/O cost of join operations for both spatial and sequence data. Our techniques optimize the available buffer space using a global view of the datasets. We build a boolean matrix on the pages of the given datasets using a lower bounding distance predictor. The marked entries of this matrix represent candidate page pairs to be joined. Our first technique joins the marked pages iteratively. Our second technique clusters the marked entries using rectangular dense regions that have minimal perimeter and fit into buffer. These clusters are then ordered so that the total number of common pages between consecutive clusters is maximal. The clusters are then read from disk and joined. Our experimental results on various real datasets show that our techniques are 2 to 86 times faster than the competing techniques for spatial datasets, and 13 to 133 times faster than the competing techniques for seq...

Tamer Kahveci, Christian A. Lang, Ambuj K. Singh

Real-time Traffic

Database | ICDE 2003 | Sequence Datasets | Spatial Datasets | Various Real Datasets |

claim paper

» HighDimensional Similarity Joins

» Visualisation of Distributions and Clusters Using ViSOMs on Gene Expression Data

» Using CategoryBased Adherence to Cluster MarketBasket Data

Post Info
More Details (n/a)

Added	01 Nov 2009
Updated	01 Nov 2009
Type	Conference
Year	2003
Where	ICDE
Authors	Tamer Kahveci, Christian A. Lang, Ambuj K. Singh

Comments (0)

Sciweavers

Joining Massive High-Dimensional Datasets

Database | ICDE 2003 | Sequence Datasets | Spatial Datasets | Various Real Datasets |

Explore & Download

Productivity Tools

Sciweavers