Low correlation between mRNA concentrations measured at different locations for the same exon show many current Ensembl exon definitions are incomplete. Automatically created patterns (e.g. TCTTT) in genic DNA sequences identify potential new alternative transcripts. Strongly typed grammar based genetic programming (GP) is used to evolve regular expressions (RE) to classify gene exons with potential alternative mRNA expression from those without. RNAnet gives us correlations between Affymetrix HG-U133 Plus 2 GeneChip probe measurements for the same exon across 2757 Homo Sapiens tissue samples from NCBI’s GEO database. We identify many non-atomic Ensembl exons. I.e. exons with substructure. Biological patterns can be data mined by a Backus-Naur form (BNF) context-free grammar using a strongly typed GP written in gawk and using egrep. The automatically produced DNA motifs suggest that alternative polyadenylation is not responsible. (Short version in [19].) The training data is avai...
William B. Langdon, Joanna Rowsell, Andrew P. Harr