Weintroduce a parallel approach, "DT-SELECT," for selecting features used by inductive learning algorithms to predict protein secondary structure. DT-SELECTis able to rapidly choose small, nonredundant feature sets from pools containing hundreds of thonsands of potentially useful features. It does this by building a decision tree, using features fromthe pool, that classifies a set of training examples. The features included in the tree provide a compact description of the training data and are thus suitable for use as inputs to other inductive learning algorithms. Empirical experiments in the protein secondary-structure task, in which sets of complex features chosen by DTSELECT are used to augment a standard artificial neural network representation, yield surprisingly little performance gain, even though features are selected from very large feature pools. Wediscuss somepossible reasons for this result.1
Kevin J. Cherkauer, Jude W. Shavlik