Sciweavers

COMAD
2008

Disk-Based Sampling for Outlier Detection in High Dimensional Data

14 years 1 months ago
Disk-Based Sampling for Outlier Detection in High Dimensional Data
We propose an efficient sampling based outlier detection method for large high-dimensional data. Our method consists of two phases. In the first phase, we combine a "sampling" strategy with a simple randomized partitioning technique to generate a candidate set of outliers. This phase requires one full data scan and the running time has linear complexity with respect to the size and dimensionality of the data set. An additional data scan, which constitutes the second phase, extracts the actual outliers from the candidate set. The running time for this phase has complexity O(CN) where C and N are the size of the candidate set and the data set respectively. The major strengths of the proposed approach are that (1) no partitioning of the dimensions is required thus making it particularly suitable for high dimensional data and (2) a small sampling set (0.5% of the original data set) can discover more than 99% of all the outliers identified by a full brute-force approach. We prese...
Timothy de Vries, Sanjay Chawla, Pei Sun, Gia Vinh
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2008
Where COMAD
Authors Timothy de Vries, Sanjay Chawla, Pei Sun, Gia Vinh Anh Pham
Comments (0)