In this paper, we present a methodology to estimate a detailed state of a video scene involving multiple humans and vehicles. In order to annotate and retrieve videos containing activities of humans and vehicles automatically, the system must correctly identify their trajectories and relationships even in a complex dynamic environment. Our methodology constructs various joint 3-D models describing possible configurations of humans and vehicles in each image frame and performs maximum-a-posteriori tracking to obtain a sequence of scene states that matches the video. Reliable and view-independent scene state analysis is performed by taking advantage of event context. We focus on the fact that events occurring in a video must contextually coincide with scene states of humans and vehicles. Our experimental results verify that our system using event context is able to analyze and track 3-D scene states of complex human-vehicle interactions more reliably and accurately than previous tracki...
M. S. Ryoo, Jong Taek Lee, Jake K. Aggarwal