Sciweavers

CCIA
2008
Springer

On the Dimensions of Data Complexity through Synthetic Data Sets

14 years 2 months ago
On the Dimensions of Data Complexity through Synthetic Data Sets
Abstract. This paper deals with the characterization of data complexity and the relationship with the classification accuracy. We study three dimensions of data complexity: the length of the class boundary, the number of features, and the number of instances of the data set. We find that the length of the class boundary is the most relevant dimension of complexity, since it can be used as an estimate of the maximum achievable accuracy rate of a classifier. The number of attributes and the number of instances do not affect classifier accuracy by themselves, if the boundary length is kept constant. The study emphasizes the use of measures revealing the intrinsic structure of data and recommends their use to extract conclusions on classifier behavior and their relative performance in multiple comparison experiments. Keywords. Data complexity, Classification, Dimensionality, Synthetic data sets
Núria Macià, Ester Bernadó-Ma
Added 12 Oct 2010
Updated 12 Oct 2010
Type Conference
Year 2008
Where CCIA
Authors Núria Macià, Ester Bernadó-Mansilla, Albert Orriols-Puig
Comments (0)