In this paper we present a new method for categorizing
video sequences capturing different scene classes. This can
be seen as a generalization of previous work on scene classification
from single images. A scene is represented by a
collection of 3D points with an appearance based codeword
attached to each point. The cloud of points is recovered
by using a robust SFM algorithm applied on the
video sequence. A hierarchical structure of histograms located
at different locations and at different scales is used
to capture the typical spatial distribution of 3D points and
codewords in the working volume. The scene is classified
by SVM equipped with a histogram matching kernel, similar
to [21, 10, 16]. Results on a challenging dataset of 5
scene categories show competitive classification accuracy
and superior performance with respect to a state-of-the-art
2D pyramid matching methods [16] applied to individual
image frames.