The problem of recognizing actions in realistic videos
is challenging yet absorbing owing to its great potentials
in many practical applications. Most previous research is
limited due to the use of simplified action databases under
controlled environments or focus on excessively localized
features without sufficiently encapsulating the spatiotemporal
context. In this paper, we propose to model the
spatio-temporal context information in a hierarchical way,
where three levels of context are exploited in ascending order
of abstraction: 1) point-level context (SIFT average descriptor),
2) intra-trajectory context (trajectory transition
descriptor), and 3) inter-trajectory context (trajectory proximity
descriptor). To obtain efficient and compact representations
for the latter two levels, we encode the spatiotemporal
context information into the transition matrix of
a Markov process, and then extract its stationary distribution
as the final context descriptor. Building on th...