Optimal outlier removal in high-dimensional

16 years 8 months ago

Download research.microsoft.com

We study the problem of finding an outlier-free subset of a set of points (or a probability distribution) in n-dimensional Euclidean space. As in [BFKV 99], a point x is defined to be a -outlier if there exists some direction w in which its squared distance from the mean along w is greater than times the average squared distance from the mean along w. Our main theorem is that for any > 0, there exists a (1- ) fraction of the original distribution that has no O(n (b+log n ))-outliers, improving on the previous bound of O(n7 b/ ). This is asymptotically the best possible, as shown by a matching lower bound. The theorem is constructive, and results in a 1 1- approximation to the following optimization problem: given a distribution ? (i.e. the ability to sample from it), and a parameter > 0, find the minimum for which there exists a subset of probability at least (1 - ) with no -outliers.

John Dunagan, Santosh Vempala

Real-time Traffic