Feature selection is fundamental to knowledge discovery from massive amount of high-dimensional data. In an effort to establish theoretical justification for feature selection algorithms, this paper presents a theoretically optimal criterion, namely the discriminative optimal criterion (DoC) for feature selection. Compared with the existing representative optimal criterion (RoC, (Koller and Sahami, 1996)) which maximizes the information for modeling the relationship between input and output variables, DoC is pragmatically advantageous because it attempts to directly maximize the classification accuracy and naturally reflects the Bayes error in the objective. To make DoC computationally tractable for practical tasks, we propose an algorithmic framework, which selects a subset of features by minimizing the nonparametric Bayes error. A set of existing algorithms as well as new ones can be derived naturally from this framework. As an example, we show that the Relief algorithm greedily ...