Generalization Methods in Bioinformatics

14 years 12 months ago

Download www.cs.rpi.edu

Protein secondary structure prediction and high-throughput drug screen data mining are two important applications in bioinformatics. The data is represented in sparse feature spaces and can be unrepresentative of future data. Supervised learners in this context will display their inherent bias toward certain solutions, generally solutions that t the training set well. In this paper, we rst describe an ensemble approach using subsampling that scales well with dataset size. A su cient number of ensemble members using subsamples of the data can yield a more accurate classi er than a single classi er using the entire dataset. Experiments on several datasets demonstrate the e ectiveness of the approach. We report results from the KDD Cup 2001 drug discovery dataset in which our approach yields a higher weighted accuracy than the winning entry. We then extend our ensemble approach to create an over-generalized classier for prediction by reducing the individual subsample size. The ensemble s...

Steven Eschrich, Nitesh V. Chawla, Lawrence O. Hal

Real-time Traffic

Data Mining | Drug Discovery Prediction | KDD 2002 | Protein Secondary Structure | Secondary Structure Prediction |

claim paper

Post Info
More Details (n/a)

Added	30 Nov 2009
Updated	30 Nov 2009
Type	Conference
Year	2002
Where	KDD
Authors	Steven Eschrich, Nitesh V. Chawla, Lawrence O. Hall

Comments (0)

Sciweavers

Generalization Methods in Bioinformatics

Data Mining | Drug Discovery Prediction | KDD 2002 | Protein Secondary Structure | Secondary Structure Prediction |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers