We propose a video event analysis framework based on object segmentation and tracking, combined with a Hidden Semi-Markov Model (HSMM) that uses state occupancy duration modeling. The observations generated by a multiobject detector and tracker are used as emitting symbols and the corresponding probabilities are computed using multivariate Gaussians. Next, we recognize events by estimating the most likely object state sequence using a HSMM decoding strategy, based on the Viterbi algorithm. Moreover, the duration distribution enforces the state transition after certain time and hence better models the events constrained on time intervals. We demonstrate and evaluate the proposed framework on a dataset of approximately 20K frames, and show that the duration modeling improves the event detection results by 7% to 11%, compared to state-ofthe-art HMMs.