Improving Rule Induction Precision for Automated Annotation by Balancing Skewed Data Sets

16 years 27 days ago

Download www.icmc.usp.br

There is an overwhelming increase in submissions to genomic databases, posing a problem for database maintenance, especially regarding annotation of ﬁelds left blank during submission. In order not to include all data as submitted, one possible alternative consists of performing the annotation manually. A less resource demanding alternative is automatic annotation. The latter helps the curator since predicting the properties of each protein sequence manually is turning a bottleneck, at least for protein databases. Machine Learning – ML – techniques have been used to generate automatic annotation and to help curators. A challenging problem for automatic annotation is that traditional ML algorithms assume a balanced training set. However, real-world data sets are predominantly imbalanced (skewed), i.e., there is a large number of examples of one class compared with just few examples of the other class. This is the case for protein databases where a large number of proteins is not a...

Gustavo E. A. P. A. Batista, Maria Carolina Monard

Real-time Traffic

Automatic Annotation | Data Sets | Informatics | KELSI 2004 | Protein Databases |

claim paper

Post Info
More Details (n/a)

Added	02 Jul 2010
Updated	02 Jul 2010
Type	Conference
Year	2004
Where	KELSI
Authors	Gustavo E. A. P. A. Batista, Maria Carolina Monard, Ana L. C. Bazzan

Comments (0)

Sciweavers

Improving Rule Induction Precision for Automated Annotation by Balancing Skewed Data Sets

Automatic Annotation | Data Sets | Informatics | KELSI 2004 | Protein Databases |

Explore & Download

Productivity Tools

Sciweavers