Online Maintenance of Very Large Random Samples

15 years 1 months ago

Download www.cise.ufl.edu

Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a "sample" is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples from a data management perspective, and no techniques now exist for maintaining very large samples in an online manner from streaming data. In this paper, we present online algorithms for maintaining on-disk samples that are gigabytes or terabytes in size. The algorithms are designed for streaming data, or for any environment where a large sample must be maintained online in a single pass through a data set. The algorith...

Chris Jermaine, Abhijit Pol, Subramanian Arumugam

Real-time Traffic

Data Management Perspective | Database | Fundamental Data Management | SIGMOD 2004 | Small Data Structure |

claim paper

Post Info
More Details (n/a)

Added	08 Dec 2009
Updated	08 Dec 2009
Type	Conference
Year	2004
Where	SIGMOD
Authors	Chris Jermaine, Abhijit Pol, Subramanian Arumugam

Comments (0)

Sciweavers

Online Maintenance of Very Large Random Samples

Data Management Perspective | Database | Fundamental Data Management | SIGMOD 2004 | Small Data Structure |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers