Multimedia event detection (MED) and multimedia event recounting (MER) are fundamental tasks in managing large amounts of unconstrained web videos, and have attracted a lot of attention in recent years. Most existing systems perform MER as a postprocessing step on top of the MED results. In order to leverage the mutual benefits of the two tasks, we propose a joint framework that simultaneously detects high-level events and localizes the indicative concepts of the events. Our premise is that a good recounting algorithm should not only explain the detection result, but should also be able to assist detection in the first place. Coupled in a joint optimization framework, recounting improves detection by pruning irrelevant noisy concepts while detection directs recounting to the most discriminative evidences. To better utilize the powerful and interpretable semantic video representation, we segment each video into several shots and exploit the rich temporal structures at shot level. The...
Xiaojun Chang, Yaoliang Yu, Yi Yang, Alexander G.