Sciweavers

IBPRIA
2007
Springer

Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues

14 years 4 months ago
Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues
Random forest is a collection (ensemble) of decision trees. It is a popular ensemble technique in pattern recognition. In this article, we apply random forest for cancer classification based on gene expression and address two issues that have been so far overlooked in other works. First, we demonstrate on two different real-world datasets that the performance of random forest is strongly influenced by dataset complexity. When estimated before running random forest, this complexity can serve as a useful performance indicator and it can explain a difference in performance on different datasets. Second, we show that one should rely with caution on feature importance used to rank genes: two forests, generated with the different number of features per node split, may have very similar classification errors on the same dataset, but the respective lists of genes ranked according to feature importance can be weakly correlated.
Oleg Okun, Helen Priisalu
Added 16 Aug 2010
Updated 16 Aug 2010
Type Conference
Year 2007
Where IBPRIA
Authors Oleg Okun, Helen Priisalu
Comments (0)