Background: Single nucleotide polymorphisms (SNPs) are locations at which the genomic sequences of population members differ. Since these differences are known to follow patterns, disease association studies are facilitated by identifying SNPs that allow the unique identification of such patterns. This process, known as haplotype tagging, is formulated as a combinatorial optimization problem and analyzed in terms of complexity and approximation properties. Results: It is shown that the tagging problem is NP-hard but approximable within 1 + ln((n2 - n)/ 2) for n haplotypes but not approximable within (1 - ) ln(n/2) for any > 0 unless NP DTIME(nlog log n). A simple, very easily implementable algorithm that exhibits the above upper bound on solution quality is presented. This algorithm has running time O( (2m - p + 1)) O(m(n2 - n)/2) where p min(n, m) for n haplotypes of size m. As we show that the approximation bound is asymptotically tight, the algorithm presented is optimal with...
Staal A. Vinterbo, Stephan Dreiseitl, Lucila Ohno-