We consider the problem of unsupervised classification of temporal sequences of facial expressions in video. This problem arises in the design of an adaptive visual agent, which must be capable of identifying appropriate classes of visual events without supervision to effectively complete its tasks. We present a multilevel dynamic Bayesian network that learns the high-level dynamics of facial expressions simultaneously with models of the expressions themselves. We show how the parameters of the model can be learned in a scalable and efficient way. We present preliminary results using real video data and a class of simulated dynamic event models. The results show that our model correctly classifies the input data comparably to a standard event classification approach, while also learning the high-level model parameters.