Most data mining operations include an integral search component at their core. For example, the performance of similarity search or classification based on Nearest Neighbors is largely dependent on the underlying compression and distance estimation techniques. As data repositories grow larger, there is an explicit need not only for storing the data in a compressed form, but also for facilitating mining operations directly on the compressed data. Naturally, the quality or tightness of the estimated distances on the compressed objects directly affects the search performance. We motivate our work within the setting of search engine weblog repositories, where keyword demand trends over time are represented and stored as compressed timeseries data. Search and analysis over such sequence data has important applications for the search engines, including discovery of important news events, keyword recommendation and efficient keyword-to-advertisement mapping. We present new mechanisms for ...
Michail Vlachos, Philip S. Yu, Suleyman S. Kozat