Learning from imbalanced datasets presents a convoluted problem both from the modeling and cost standpoints. In particular, when a class is of great interest but occurs relatively rarely such as in cases of fraud, instances of disease, and regions of interest in largescale simulations, there is a correspondingly high cost for the misclassification of rare events. Under such circumstances, the data set is often re-sampled to generate models with high minority class accuracy. However, the sampling methods face a common, but important, criticism: how to automatically discover the amount and type of sampling? To address this problem, we propose a wrapper paradigm that discovers the amount of resampling for a data set based on optimizing evaluation functions like the f-measure, Area Under the ROC Curve (AUROC), cost, cost-curves, and cost dependent f-measure. Our analysis of the wrapper is two-fold. First, we report the interaction between different evaluation and wrapper optimization func...
Nitesh V. Chawla, David A. Cieslak, Lawrence O. Ha