For small samples, classi er design algorithms typically suffer from over tting. Given a set of features, a classi er must be designed and its error estimated. For small samples, an error estimator may be unbiased but, owing to a large variance, often give very optimistic estimates. This paper proposes mitigating the small-sample problem by designing classi ers from a probability distribution resulting from spreading the mass of the sample points to make classi cation more dif cult, while maintaining sample geometry. The algorithm is parameterized by the variance of the spreading distribution. By increasing the spread, the algorithm nds gene sets whose classi cation accuracy remains strong relative to greater spreading of the sample. The error gives a measure of the strength of the feature set as a function of the spread. The algorithm yields feature sets that can distinguish the two classes, not only for the sample data, but for distributions spread beyond the sample data. For linear...
Seungchan Kim, Edward R. Dougherty, Junior Barrera