In many data mining problems the definition of what structures in the database are to be regarded as interesting or valuable is given only loosely. Typically this is regarded as a source of ambiguity and imprecision. However, we propose taking advantage of the looseness of the definition by choosing a particular definition which optimises some additional criterion. We illustrate using a consumer credit data set, where the definition of what constitutes a bad risk customer is somewhat arbitrary. Instead of adopting the common strategy of freely choosing some definition, we choose that which optimises predictability. That is, we choose to define our classes on the grounds that they are the ones amongst those which can be most accurately predicted.
Mark G. Kelly, David J. Hand, Niall M. Adams