We present an approach for simultaneous monocular 3D face pose and facial animation tracking. The pose and facial features are estimated from observed raw brightness shape-free 2D image patches. A parameterized 3D face model is adopted to crop out and to normalize the shape of patches from video frames. Starting from the face model aligned on an observed human face, we learn the relation between a set of perturbed parameters of the face model and the associated image patches using a Canonical Correlation Analysis. This knowledge, obtained from an observed patch in the current frame, is used to estimate the correction to be added to the pose of the face and to the animation parameters controlling the lips, eyebrows and eyes. Ground truth data is used to evaluate both the pose and facial animation tracking efficiency in long real video sequences.