Disk-Based Sampling for Outlier Detection in High Dimensional Data

14 years 4 months ago

Download www.cse.iitb.ac.in

We propose an efficient sampling based outlier detection method for large high-dimensional data. Our method consists of two phases. In the first phase, we combine a "sampling" strategy with a simple randomized partitioning technique to generate a candidate set of outliers. This phase requires one full data scan and the running time has linear complexity with respect to the size and dimensionality of the data set. An additional data scan, which constitutes the second phase, extracts the actual outliers from the candidate set. The running time for this phase has complexity O(CN) where C and N are the size of the candidate set and the data set respectively. The major strengths of the proposed approach are that (1) no partitioning of the dimensions is required thus making it particularly suitable for high dimensional data and (2) a small sampling set (0.5% of the original data set) can discover more than 99% of all the outliers identified by a full brute-force approach. We prese...

Timothy de Vries, Sanjay Chawla, Pei Sun, Gia Vinh

Real-time Traffic

Candidate Set | COMAD 2008 | Data Scan | Data Sets | Knowledge Management |

claim paper

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2008
Where	COMAD
Authors	Timothy de Vries, Sanjay Chawla, Pei Sun, Gia Vinh Anh Pham

Comments (0)

Sciweavers

Disk-Based Sampling for Outlier Detection in High Dimensional Data

Candidate Set | COMAD 2008 | Data Scan | Data Sets | Knowledge Management |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers