Many algorithms have been proposed to approximate holistic aggregates, such as quantiles and heavy hitters, over data streams. However, little work has been done to explore what techniques are required to incorporate these algorithms in a data stream query processor, and to make them useful in practice. In this paper, we study the performance implications of using user-defined aggregate functions (UDAFs) to incorporate selectionbased and sketch-based algorithms for holistic aggregates into a data stream management system's query processing architecture. We identify key performance bottlenecks and tradeoffs, and propose novel techniques to make these holistic UDAFs fast and spaceefficient for use in high-speed data stream applications. We evaluate performance using generated and actual IP packet data, focusing on approximating quantiles and heavy hitters. The best of our current implementations can process streaming queries at OC48 speeds (2x 2.4Gbps).
Graham Cormode, Theodore Johnson, Flip Korn, S. Mu