We consider the problem of extracting features for multi-class recognition problems. The features are required to make fine distinction between similar classes, combined with tolerance for distortions and missing information. We define and compare two general approaches, both based on maximizing the delivered information for recognition: one divides the problem into multiple binary classification tasks, while the other uses a single multiclass scheme. The two strategies result in markedly different sets of features, which we apply to face identification and detection. We show that the first produces a sparse set of distinctive features that are specific to an individual face, and are highly tolerant to distortions and missing input. The second produces compact features, each shared by about half of the faces, and which perform better in general face detection. The results show the advantage of distinctive features for making fine distinctions in a robust manner. They also show that di...