This paper presents a robust and accurate method for joint head pose and facial actions tracking, even under challenging conditions such as varying lighting, large head movements, and fast motion. This is made possible by the combination of two types of facial features. We use locations sampled from the facial texture whose appearance is initialized on the first frame and adapted over time, and also illumination-invariant patches located on characteristic points of the face such as the corners of the eyes or of the mouth. The first type of features contains rich information about the global appearance of the face and thus leads to an accurate tracking, while the second type guaranties robustness and stability by avoiding drift. We demonstrate our system on the Boston University Face Tracking benchmark, and show it outperforms state-of-the-art methods.