In this paper, a high-level optimization methodology is applied for the implementation of the well-known Convolutional Face Finder (CFF) algorithm for real-time applications on cellular phone, such as teleconferencing, advanced user interfaces, pictures indexing and security access control. This face detector is based on a feature extraction and classification technique which consists in a pipeline of convolutions and subsampling operations. Design of embedded systems must find a good trade off between performance and code size due to the limited amount of resource available. We propose a methodology to cope with the main drawbacks of the CFF original implementation like floatingpoint computation and memory allocation, to allow parallelism exploitation and perform algorithm optimizations. Results show that our embedded face detection system can accurately locate faces with less computational load and memory cost. It runs on a 275MHz Starcore DSP at 9 QCIF images/s with state-of-the-ar...