Finding and removingoutliers is an important problem in data mining. Errors in large databases can be extremely common,so an important property of a data mining algorithm is robustness with respect to errors in the database. Mostsophisticated methods in machinelearning address this problemto someextent, but not fully, andcan be improvedby addressing the problemmoredirectly. In this paper weexamine C4.5, a decision tree algorithm that is already quite robust - few algorithms have been shownto consistently achieve higher accuracy. C4.5 incorporates a pruning schemethat partially addresses the outfier removal problem. In our ROBUST-C4.5algorithm we extend the pruning methodto fully removethe effect of outliers, and this results in improvementon many databases.
George H. John