Mining distance-based outliers in near linear time with randomization and a simple pruning rule

16 years 7 months ago

Download www.isle.org

Defining outliers by their distance to neighboring examples is a popular approach to finding unusual examples in a data set. Recently, much work has been conducted with the goal of finding fast algorithms for this task. We show that a simple nested loop algorithm that in the worst case is quadratic can give near linear time performance when the data is in random order and a simple pruning rule is used. We test our algorithm on real high-dimensional data sets with millions of examples and show that the near linear scaling holds over several orders of magnitude. Our average case analysis suggests that much of the efficiency is because the time to process non-outliers, which are the majority of examples, does not depend on the size of the data set. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications-data mining Keywords Outliers, distance-based operations, anomaly detection, diskbased algorithms

Stephen D. Bay, Mark Schwabacher

Real-time Traffic

Data Mining | KDD 2003 | Mining Keywords Outliers | Nested Loop Algorithm | Simple Pruning Rule |

claim paper

Added	30 Nov 2009
Updated	30 Nov 2009
Type	Conference
Year	2003
Where	KDD
Authors	Stephen D. Bay, Mark Schwabacher

Sciweavers

Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Data Mining | KDD 2003 | Mining Keywords Outliers | Nested Loop Algorithm | Simple Pruning Rule |

Explore & Download

Productivity Tools

Sciweavers