A data reduction approach for resolving the imbalanced data issue in functional genomics

15 years 6 months ago

Download sci2s.ugr.es

Learning from imbalanced data occurs frequently in many machine learning applications. One positive example to thousands of negative instances is common in scientiﬁc applications. Unfortunately, traditional machine learning techniques often treat rare instances as noise. One popular approach for this difﬁculty is to resample the training data. However, this results in high false positive predictions. Hence, we propose preprocessing training data by partitioning them into clusters. This greatly reduces the imbalance between minority and majority instances in each cluster. For moderate imbalance ratio, our technique gives better prediction accuracy than other resampling method. For extreme imbalance ratio, this technique serves as a good ﬁlter that reduces the amount of imbalance so that traditional classiﬁcation techniques can be deployed. More importantly, we have successfully applied our techniques to splice site prediction and protein subcellular localization problem, with si...

Kihoon Yoon, Stephen Kwek

Real-time Traffic