Phylogenetic profiles of proteins − strings of ones and zeros encoding respectively the presence and absence of proteins in a group of genomes − have recently been used to identify homologous proteins and/or proteins that are functionally linked, such as participating in a metabolic pathway. We proposed a novel learning method for protein classification based on phylogenetic profiles, which takes into account both the phylogenetic tree structure and the likelihood of proteins presence in genomes. The method consists of a mechanism to extend the profiles with extra bits encoding the phylogenetic tree, whose interior nodes, representing hypothetical ancestral genomes, are scored in a way to reflect their chances of developing divergence in the descendants. The scoring scheme also incorporates the likelihood of proteins presence in genomes as weighting factors, which are collected from the training data initially and integrated as part of kernel of a support vector machine. In a trans...
Roger A. Craig, Li Liao