We consider the problem of compressibility of protein sequences. Based on an observed genome-scale long-range correlation in concatenated protein sequences from different organisms, we propose a method to exploit this unusual redundancy in compressing the protein sequences. The result is a significant reduction in the number of bits required for representing the sequences. We report results in bits per symbol (bps) of 2.27, 2.55, 3.11 and 3.44 for protein sequences from M. jannaschii, H. influenzae, S. cerevisiae, and H. sapiens respectively, the same protein sequences used by Nevill-Manning and Witten in the "Protein is incompressible" paper [23]. The observed long-range correlations could have significant implications beyond compression and complexity analysis of protein sequences.
Donald A. Adjeroh, Fei Nan