We show that randomization can lead to significant improvements for a few fundamental problems in distributed tracking. Our basis is the count-tracking problem, where there are k...
Data is increasingly being bought and sold online, and Webbased marketplace services have emerged to facilitate these activities. However, current mechanisms for pricing data are ...
We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into ...
Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang,...
Shark is a research data analysis system built on a novel rained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing a unifie...
Cliff Engle, Antonio Lupher, Reynold Xin, Matei Za...
Locality-Sensitive Hashing (LSH) and its variants are wellknown methods for solving the c-approximate NN Search problem in high-dimensional space. Traditionally, several LSH funct...
Big data is the tar sands of the data world: vast reserves of raw gritty data whose valuable information content can only be extracted at great cost. MapReduce is a popular parall...
We present BloomUnit, a testing framework for distributed programs written in the Bloom language. BloomUnit allows developers to write declarative test specifications that descri...
Peter Alvaro, Andrew Hutchinson, Neil Conway, Will...
In this demo, we will present Tiresias, the first how-to query engine. How-to queries represent fundamental data analysis questions of the form: “How should the input change in...
The advent of affordable, shared-nothing computing systems portends a new class of parallel database management systems (DBMS) for on-line transaction processing (OLTP) applicatio...