An attractive approach to improve tracking performance for visual surveillance is to use information from multiple visual sensory cues such as position, color, shape, etc. Previous work in fusion for tracking has tended to focus on fusion by numerically combining the scores assigned by each cue. We argue that for video scenes with many targets in a crowded situation, the splitting and merging of regions associated with targets, and the subsequent dramatic changes in cue values and reliabilities, renders this form of fusion less effective. In this paper we present experimental results showing that use of cue rank information in fusion produces a significantly better tracking result in crowded scenes. We also present a formalization of this fusion problem as a step in understanding why this effect occurs and how to build a tracking system that exploits it.
Damian M. Lyons, D. Frank Hsu