Magical thinking in data mining: lessons from CoIL challenge 2000

16 years 3 months ago

Download cseweb.ucsd.edu

CoIL challenge 2000 was a supervised learning contest that attracted 43 entries. The authors of 29 entries later wrote explanations of their work. This paper discusses these reports and reaches three main conclusions. First, naive Bayesian classifiers remain competitive in practice: they were used by both the winning entry and the next best entry. Second, identifying feature interactions correctly is important for maximizing predictive accuracy: this was the difference between the winning classifier and all others. Third and most important, too many researchers and practitioners in data mining do not appreciate properly the issue of statistical significance and the danger of overfitting. Given a dataset such as the one for the CoIL contest, it is pointless to apply a very complicated learning algorithm, or to perform a very time-consuming model search. In either case, one is likely to overfit the training data and to fool oneself in estimating predictive accuracy and in discovering us...

Charles Elkan

Real-time Traffic

Data Mining | KDD 2001 | Naive Bayesian Classifiers | Predictive Accuracy | Supervised Learning Contest |

claim paper

Added	30 Nov 2009
Updated	30 Nov 2009
Type	Conference
Year	2001
Where	KDD
Authors	Charles Elkan

Sciweavers

Magical thinking in data mining: lessons from CoIL challenge 2000

Data Mining | KDD 2001 | Naive Bayesian Classifiers | Predictive Accuracy | Supervised Learning Contest |

Explore & Download

Productivity Tools

Sciweavers