There has been a flurry of works on video sequence-based face recognition in recent years. One of the hard problems in this area is how to effectively combine the facial configuration and temporal dynamics for the recognition task. The proposed method treats this problem in two steps. We first construct several view specific appearance submanifolds learned from the training video frames using locally linear embedding (LLE). A general Bayesian inference model is then fit on the recognition task, transforming the complicated maximum likelihood estimation to some elegant distance measures in the learned sub-manifolds. Experimental results on a middle-scale video database demonstrate the effectiveness and flexibility of our proposed method.