We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet...
Similarity search methods are widely used as kernels in various data mining and machine learning applications including those in computational biology, web search/clustering. Near...
The MapReduce framework is increasingly being used to analyze large volumes of data. One important type of data analysis done with MapReduce is log processing, in which a click-st...
Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Ju...
Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clust...
Classification is an important problem in data mining. Given a database of records, each with a class label, a classifier generates a concise and meaningful description for each c...