Protein secondary structure prediction and high-throughput drug screen data mining are two important applications in bioinformatics. The data is represented in sparse feature spaces and can be unrepresentative of future data. Supervised learners in this context will display their inherent bias toward certain solutions, generally solutions that t the training set well. In this paper, we rst describe an ensemble approach using subsampling that scales well with dataset size. A su cient number of ensemble members using subsamples of the data can yield a more accurate classi er than a single classi er using the entire dataset. Experiments on several datasets demonstrate the e ectiveness of the approach. We report results from the KDD Cup 2001 drug discovery dataset in which our approach yields a higher weighted accuracy than the winning entry. We then extend our ensemble approach to create an over-generalized classier for prediction by reducing the individual subsample size. The ensemble s...
Steven Eschrich, Nitesh V. Chawla, Lawrence O. Hal