Many recent works have attempted to improve object recognition by exploiting temporal dynamics, an intrinsic property of video sequences. In this paper, a new spatiotemporal hierarchical agglomerative clustering (STHAC) method is proposed for automatic extraction of face exemplars for face recognition in video sequences. Two variants of STHAC are presented – a global variety that unifies spatial and temporal distances between points, and a local variety that introduces perturbation of distances based on a local spatio-temporal neighborhood criterion. Faces that are nearest to the cluster means are chosen as exemplars for the testing stage, where subjects in the test video sequences are recognized using a probabilisticbased classifier. Extensive evaluation on a face video database demonstrates the effectiveness of our proposed method, and the significance of incorporating temporal information for exemplar extraction.