Sciweavers

KDD
2001
ACM

The "DGX" distribution for mining massive, skewed data

15 years 25 days ago
The "DGX" distribution for mining massive, skewed data
Skewed distributions appear very often in practice. Unfortunately, the traditional Zipf distribution often fails to model them well. In this paper, we propose a new probability distribution, the Discrete Gaussian Exponential (DGX), to achieve excellent fits in a wide variety of settings; our new distribution includes the Zipf distribution as a special case. We present a statistically sound method for estimating the DGX parameters based on maximum likelihood estimation (MLE). We applied DGX to a wide variety of real world data sets, such as sales data from a large retailer chain, usage data from AT&T, and Internet clickstream data; in all cases, DGX fits these distributions very well, with almost a 99% correlation coefficient in quantile-quantile plots. Our algorithm also scales very well because it requires only a single pass over the data. Finally, we illustrate the power of DGX as a new tool for data mining tasks, such as outlier detection. Keywords DGX, Zipf's law, rank-fr...
Zhiqiang Bi, Christos Faloutsos, Flip Korn
Added 30 Nov 2009
Updated 30 Nov 2009
Type Conference
Year 2001
Where KDD
Authors Zhiqiang Bi, Christos Faloutsos, Flip Korn
Comments (0)