Automatic facial action unit (AU) detection from video is a long-standing problem in computer vision. Two main approaches have been pursued: (1) static modeling--typically posed as a discriminative classification problem in which each video frame is evaluated independently; (2) temporal modeling--frames are segmented into sequences and typically modeled with a variant of dynamic Bayesian networks. We propose a segment-based approach, kSeg-SVM, that incorporates benefits of both approaches and avoids their limitations. kSeg-SVM is a temporal extension of the spatial bag-of-words. kSeg-SVM is trained within a structured output SVM framework that formulates AU detection as a problem of detecting temporal events in a time series of visual features. Each segment is modeled by a variant of the BoW representation with soft assignment of the words based on similarity. Our framework has several benefits for AU detection: (1) both dependencies between features and the length of action units are...