Genetic programming (GP) based data fusion and AdaBoost can both improve in vitro prediction of Cytochrome P450 activity by combining artificial neural networks (ANN). Pharmaceutical drug design data provided by high throughput screening (HTS) is used to train many base ANN classifiers. In data mining (KDD) we must avoid over fitting. The ensembles do extrapolate from the training data to other unseen molecules. I.e. they predict inhibition of a P450 enzyme by compounds unlike the chemicals used to train them. Thus the models might provide in silico screens of virtual chemicals as well as physical ones from Glaxo SmithKline (GSK)’s cheminformatics database. The receiver operating characteristics (ROC) of boosted and evolved ensemble are given.
William B. Langdon, S. J. Barrett, Bernard F. Buxt