Over the years, several spatio-temporal interest point detectors have been proposed. While some detectors can only extract a sparse set of scaleinvariant features, others allow for the detection of a larger amount of features at user-defined scales. This paper presents for the first time spatio-temporal interest points that are at the same time scale-invariant (both spatially and temporally) and densely cover the video content. Moreover, as opposed to earlier work, the features can be computed efficiently. Applying scale-space theory, we show that this can be achieved by using the determinant of the Hessian as the saliency measure. Computations are speeded-up further through the use of approximative box-filter operations on an integral video structure. A quantitative evaluation and experimental results on action recognition show the strengths of the proposed detector in terms of repeatability, accuracy and speed, in comparison with previously proposed detectors.
Geert Willems, Tinne Tuytelaars, Luc J. Van Gool