1 W’edescribe a novel approach for predicting the function of a protein from its amino-acid sequence. Given features that can be computedfrom the amino-acid sequence in a straightforward fashion (such as pI, molecular weight, and amino-acid composition), the technique allows us to answer questions such as: Is the protein an enzyme? If so, in which EnzymeCommission (EC) class does it belong? Our approach uses machinelearning (ML)techniques to induce classifiers that predict the ECclass of an enzymefrom features extracted from its primary sequence. Wereport on a variety of experiments in which we explored the use of three different MLtechniques in conjunction with training datasets derived from PDBand from SwissProt. Wealso explored the use of several different feature sets. Our methodis able to predict the first EC number of an enzyme with 74%accuracy (thereby assigning the enzymeto one of six broad categories of enzyme function), and to predict the second ECnumber of an enzymewith 68...
Marie desJardins, Peter D. Karp, Markus Krummenack