Sciweavers

BMCBI
2011

N-gram analysis of 970 microbial organisms reveals presence of biological language models

13 years 7 months ago
N-gram analysis of 970 microbial organisms reveals presence of biological language models
Background: It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as “signature-style” word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may therefore be applied to genome sequences to draw biologically relevant conclusions. Following this approach of ‘biological language modeling’, statistical n-gram analysis has been applied for comparative analysis of whole proteome sequences of 44 organisms. It has been shown that a few particular amino acid n-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. At that time proteomes of only 44 organisms were available, thereby limiting the generalization of this hypothesis. Today nearly 1,000 genome sequences and corresponding translated sequences are available, making it feasible to test the existence of biological lang...
Hatice U. Osmanbeyoglu, Madhavi Ganapathiraju
Added 12 May 2011
Updated 12 May 2011
Type Journal
Year 2011
Where BMCBI
Authors Hatice U. Osmanbeyoglu, Madhavi Ganapathiraju
Comments (0)