Histograms are used in many ways in conventional databases and in data stream processing for summarizing massive data distributions. Previous work on constructing histograms on data streams with provable guarantees have not taken into account the workload characteristics of databases which show some parts of the distributions to be more frequently used than the others; on the other hand, previous work for constructing histograms that do make use of the workload characteristics–and have demonstrated the significant advantage of exploiting workload information–have not come with provable guarantees on the accuracy of the histograms or the time and space bounds needed to obtain reasonable accuracy. We study the algorithmic complexity of constructing workload-optimal histograms on data streams. We present an algorithm for constructing a nearly-optimal histogram in nearly linear time and polylogarithmic space, in one pass. In the more general cash register model where data is streamed...
S. Muthukrishnan, Martin Strauss, X. Zheng