Many machine learning algorithms can be formulated as the minimization of a training criterion which involves (1) \training errors" on each training example and (2) some hyper-parameters, which are kept xed during this minimization. When there is only a single hyper-parameter one can easily explore how its value a ects a model selection criterion (that is not the same as the training criterion, and is used to select hyperparameters). In this paper we present a methodology to select many hyper-parameters that is based on the computation of the gradient of a model selection criterion with respect to the hyper-parameters. We rst consider the case of a training criterion that is quadratic in the parameters. In that case, the gradient of the selection criterion with respect to the hyper-parameters is e ciently computed by back-propagating through a Cholesky decomposition. In the more general case, we show that the implicit function theorem can be used to derive a formula for the hyper...