Sciweavers

CASDMKM
2004
Springer

Data Set Balancing

14 years 4 months ago
Data Set Balancing
This paper conducts experiments with three skewed data sets, seeking to demonstrate problems when skewed data is used, and identifying counter problems when data is balanced. The basic data mining algorithms of decision tree, regression-based, and neural network models are considered, using both categorical and continuous data. Two of the data sets have binary outcomes, while the third has a set of four possible outcomes. Key findings are that when the data is highly unbalanced, algorithms tend to degenerate by assigning all cases to the most common out come. When data is balanced, accuracy rates tend to decline. If data is balanced, that reduces the training set size, and can lead to the degeneracy of model failure through omission of cases encountered in the test set. Decision tree algorithms were found to be the most robust with respect to the degree of balancing applied.
David L. Olson
Added 01 Jul 2010
Updated 01 Jul 2010
Type Conference
Year 2004
Where CASDMKM
Authors David L. Olson
Comments (0)