Understanding the relationship between protein structure and its sequence is one of the most important tasks of current bioinformatics research. In this work, recurring protein sequence motifs are explored with a K-means clustering algorithm. No structural information is used during the clustering process so that the relationship between sequence similarity and structural similarity for sequence-based clusters can be studied. This work focuses on characterizing structural similarity so that the quality of sequence clusters can be assessed accurately. Analysis of results reveals that the combined metric of distance matrix root mean squared deviation for sequence cluster (dmRMSD_SC) and torsion angle RMSD_SC (taRMSD_SC) can provide the reliable indication of structural similarity for sequence clusters. Based on our combined metric, the recurrent sequence clusters with high structural similarity are used to generate sequence motifs. The common 3D structure of a sequence motif is represen...
Wei Zhong, Gulsah Altun, Robert W. Harrison, Phang