We present a novel model for human action categorization. A video sequence is represented as a collection of spatial and spatial-temporal features by extracting static and dynamic interest points. We propose a hierarchical model that can be characterized as a constellation of bagsof-features and that is able to combine both spatial and spatial-temporal features. Given a novel video sequence, the model is able to categorize human actions in a frameby-frame basis. We test the model on a publicly available human action dataset [2] and show that our new method performs well on the classification task. We also conducted control experiments to show that the use of the proposed mixture of hierarchical models improves the classification performance over bag of feature models. An additional experiment shows that using both dynamic and static features provides a richer representation of human actions when compared to the use of a single feature type, as demonstrated by our evaluation in the cla...