Sciweavers

SIGMOD
2004
ACM

Online Maintenance of Very Large Random Samples

14 years 11 months ago
Online Maintenance of Very Large Random Samples
Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a "sample" is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples from a data management perspective, and no techniques now exist for maintaining very large samples in an online manner from streaming data. In this paper, we present online algorithms for maintaining on-disk samples that are gigabytes or terabytes in size. The algorithms are designed for streaming data, or for any environment where a large sample must be maintained online in a single pass through a data set. The algorith...
Chris Jermaine, Abhijit Pol, Subramanian Arumugam
Added 08 Dec 2009
Updated 08 Dec 2009
Type Conference
Year 2004
Where SIGMOD
Authors Chris Jermaine, Abhijit Pol, Subramanian Arumugam
Comments (0)