Complete Cross-Validation for Nearest Neighbor Classifiers

16 years 7 months ago

Download www-2.cs.cmu.edu

Cross-validation is an established technique for estimating the accuracy of a classifier and is normally performed either using a number of random test/train partitions of the data, or using kfold cross-validation. We present a technique for calculating the complete cross-validation for nearest-neighbor classifiers: i.e., averaging over all desired test/train partitions of data. This technique is applied to several common classifier variants such as K-nearest-neighbor, stratified data partitioning and arbitrary loss functions. We demonstrate, with complexity analysis and experimental timing results, that the technique can be performed in time comparable to k-fold cross-validation, though in effect it averages an exponential number of trials. We show that the results of complete cross-validation are biased equally compared to subsampling and kfold cross-validation, and there is some reduction in variance. This algorithm offers significant benefits both in terms of time and accuracy.

Matthew D. Mullin, Rahul Sukthankar

Real-time Traffic