Mining massive temporal data streams for significant trends, emerging buzz, and unusually high or low activity is an important problem with several commercial applications. In this paper, we propose a framework based on relational records and metric spaces to study such problems. Our framework provides the necessary mathematical underpinnings for this genre of problems, and leads to efficient algorithms in the stream/sort model of massive data sets (where the algorithm makes passes over the data, computes a new stream on the fly, and is allowed to sort the intermediate data). Our algorithm makes novel use of metric approximations in the data stream context, and highlights the role of hierarchical organization of large data sets in designing efficient algorithms in the stream/sort model.
Sreenivas Gollapudi, D. Sivakumar