Often the most expensive and time-consuming task in building a pattern recognition system is col lecting and accurately labeling training and testing data. In this paper, we explore the use of inexpen sive noisy testing data for evaluating a classifier's performance. We assume 1) the (human) labeler provides category labels with a known mislabeling rate and 2) the trained classifier and the labeler are statistically independent. We then derive the num ber of "noisy" test samples that arc, on average, equivalent to a single perfectly labeled test sam ple for the task of evaluating the classifier's perfor mance. For practical and realistic error and misla beling rates, this number of equivalent test patterns can be surprisingly low. We also derive an upper and lower bound for the true error rate when the labeler and the classifier are not independent.
Chuck P. Lam, David G. Stork