We present a new method for classification with structured
latent variables. Our model is formulated using the
max-margin formalism in the discriminative learning literature.
We propose an efficient learning algorithm based on
the cutting plane method and decomposed dual optimization.
We apply our model to the problem of recognizing
human actions from video sequences, where we model a human
action as a global root template and a constellation
of several “parts”. We show that our model outperforms
another similar method that uses hidden conditional random
fields, and is comparable to other state-of-the-art approaches.
More importantly, our proposed work is quite
general and can potentially be applied in a wide variety of
vision problems that involve various complex, interdependent
latent structures.