Finding recurrent sources in sequences

16 years 2 months ago

Download www.cs.helsinki.fi

Many genomic sequences and, more generally, (multivariate) time series display tremendous variability. However, often it is reasonable to assume that the sequence is actually generated by or assembled from a small number of sources, each of which might contribute several segments to the sequence. That is, there are h hidden sources such that the sequence can be written as a concatenation of k > h pieces, each of which stems from one of the h sources. We define this (k, h)segmentation problem and show that it is NP-hard in the general case. We give approximation algorithms achieving approximation ratios of 3 for the L1 error measure and 5 for the L2 error measure, and generalize the results to higher dimensions. We give empirical results on real (chromosome 22) and artificial data showing that the methods work well in practice. Categories and Subject Descriptors F.2.2 [Analysis of algorithms and problem complexity]: Nonnumerical algorithms and problems--Computations on discrete str...

Aristides Gionis, Heikki Mannila

Real-time Traffic