We propose a framework that performs action recognition and identity maintenance of multiple targets simultaneously. Instead of first establishing tracks using an appearance model and then performing action recognition, we construct a network flow-based model that links detected bounding boxes across video frames while inferring activities, thus integrating identity maintenance and action recognition. Inference in our model reduces to a constrained minimum cost flow problem, which we solve exactly and efficiently. By leveraging both appearance similarity and action transition likelihoods, our model improves on stateof-the-art results on action recognition for two datasets.
Sameh Khamis, Vlad I. Morariu, Larry S. Davis