Sciweavers

BMCBI
2007

Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assess

14 years 19 days ago
Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assess
Background: Similarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. It is currently primarily handled using alignments. However, the alignment methods seem inadequate for post-genomic studies since they do not scale well with data set size and they seem to be confined only to genomic and proteomic sequences. Therefore, alignment-free similarity measures are actively pursued. Among those, USM (Universal Similarity Metric) has gained prominence. It is based on the deep theory of Kolmogorov Complexity and universality is its most novel striking feature. Since it can only be approximated via data compression, USM is a methodology rather than a formula quantifying the similarity of two strings. Three approximations of USM are available, namely UCD (Universal Compression Dissimilarity), NCD (Normalized Compression Dissimilarity) and CD (Compression Dissimilarity). Their applicability and robustness is tested on various data sets yielding ...
Paolo Ferragina, Raffaele Giancarlo, Valentina Gre
Added 09 Dec 2010
Updated 09 Dec 2010
Type Journal
Year 2007
Where BMCBI
Authors Paolo Ferragina, Raffaele Giancarlo, Valentina Greco, Giovanni Manzini, Gabriel Valiente
Comments (0)