In this paper, we present a Deformable Action Template
(DAT) model that is learnable from cluttered real-world
videos with weak supervisions. In our generative model,
an action template is a sequence of image templates each of
which consists of a set of shape and motion primitives (Gabor
wavelets and optical-flow patches) at selected orientations
and locations. These primitives are allowed to slightly
perturb their locations and orientations to account for spatial
deformations. We use a shared pursuit algorithm to automatically
discover a best set of primitives and weights by
maximizing the likelihood over one or more aligned training
examples. Since it is extremely hard to accurately label
human actions from real-world videos, we use a threestep
semi-supervised learning procedure. 1) For each human
action class, a template is initialized from a labeled
(one bounding-box per frame) training video. 2) The template
is used to detect actions from other training videos of
...