We study the problem of minimizing the expected loss of a linear predictor while constraining its sparsity, i.e., bounding the number of features used by the predictor. While the resulting optimization problem is generally NP-hard, several approximation algorithms are considered. We analyze the performance of these algorithms, focusing on the characterization of the trade-off between accuracy and sparsity of the learned predictor in different scenarios.