Existing feature extraction methods explore either global statistical or local geometric information underlying the data. In this paper, we propose a general framework to learn features that account for both types of information based on variational optimization of nonparametric learning criteria. Using mutual information and Bayes error rate as example criteria, we show that high-quality features can be learned from a variational graph embedding procedure, which is solved through an iterative EM-style algorithm where the E-Step learns a variational affinity graph and the M-Step in turn embeds this graph by spectral analysis. The resulting feature learner has several appealing properties such as maximum discrimination, maximum-relevanceminimum-redundancy and locality-preserving. Experiments on benchmark face recognition data sets confirm the effectiveness of our proposed algorithms.