In this paper the sparse coding principle is employed for the representation of multimodal image data, i.e. image intensity and range. We estimate an image basis for frontal face images taken with a Time-ofFlight (TOF) camera to obtain a sparse representation of facial features, such as the nose. These features are then evaluated in an object detection scenario where we estimate the position of the nose by template matching and a subsequent application of appropriate thresholds that are estimated from a labeled training set. The main contribution of this work is to show that the templates can be learned simultaneously on both intensity and range data based on the sparse coding principle, and that these multimodal templates significantly outperform templates generated by averaging over a set of aligned image patches containing the facial feature of interest as well as multimodal templates computed via Principal Component Analysis (PCA). The system achieves a detection rate of 96.4% on ...