This paper discusses a potential methodological problem with empirical studies assessing project effort prediction systems. Frequently a hold-out strategy is deployed so that the data set is split into a training and a validation set. Inferences are then made concerning the relative accuracy of the different prediction techniques under examination. Typically this is done on very small numbers of sampled training sets. We show that such studies can lead to almost random results (particularly where relatively small effects are being studied). To illustrate this problem, we analyse two data sets, using a configuration problem for case-based prediction and generate results from 100 training sets. This enables us to produce results with quantified confidence limits. From this we conclude that in both cases using less than five training sets leads to untrustworthy results and ideally more than 20 sets should be deployed. Unfortunately this poses something of a question over a number of empi...
Colin Kirsopp, Martin J. Shepperd