In this paper we describe two related approaches to estimating the sample sizes required to statistically compare the performance of two classifiers: acceptable failure rates (AFR) and the area under the receiver operating characteristic (ROC) curve (AUC). In particular, we consider rare event detection problems, where the prior class probabilities are highly skewed, and measure performance at a specific operating point and for the whole ROC curve. It is shown that the use of AUC as a performance measure is preferable to AFR as it requires a smaller data set to demonstrate superiority of one classifier over another.
Andrew P. Bradley, I. Dennis Longstaff