We present a novel method to transfer speech animation recorded in low resolution videos onto realistic 3D facial models. Unsupervised learning is utilized on a speech video corpus to find underlying manifold of facial configurations. K-means clustering is applied on the low dimensional space to find key speaking-related facial shapes. With a small set of laser scanner captured 3D models related to the clustering centroid, the facial animation in 2D videos is transferred onto 3D shapes. Especially by virtue of a weak perspective projection model, the underlying mandible rotation is recovered from videos and is utilized to drive 3D skull movements. The adaption of a generic skull onto facial models is guided by a 2D image, Tissue Map. With parsimonious data requirements, our system realizes the animation transferring and gains a realistic rendering effect with the underlying anatomical structure.