Clustering of EST data is a method for the non-redundant representation of an organisms transcriptome. During clustering of large amounts of EST data, usually some large clusters (>500 sequences) are created. Those can lead to iterative contig builds, consumation of lots of computing time and improbable exon alignments, which is unfavourable. In addition, these clusters sometimes contain transcripts for more than one gene, which is not desired. Such large clusters come into existence due to: (1) large numbers of identical ESTs / high transcript levels; (2) large gene families with highly similar members; (3) false clustering due to a) unremoved vector or rRNA sequences, b) undetected cloning artifacts or c) repetitive elements in UTRs. During pre-processing (filtering and masking) of the sequence raw data, contaminations such as vector or linker sequences as well as bacterial genes are being removed (clipping). In the same process, it is essential to mask repetitive elements in ord...
Stefan A. Rensing, Daniel Lang, Ralf Reski