Vision-based user interfaces enable natural interaction modalities such as gestures. Such interfaces require computationally intensive video processing at low latency. We demonstrate an application that recognizes gestures to control TV operations. Accurate recognition is achieved by using a new descriptor called MoSIFT, which explicitly encodes optical flow with appearance features. MoSIFT is computationally expensive — a sequential implementation runs 100 times slower than real time. To reduce latency sufficiently for interaction, the application is implemented on a runtime system that exploits the parallelism inherent in video understanding applications. Categories and Subject Descriptors C.3 [Computer Systems Organization]: Special-Purpose and application base systems. D.2 [Software] Software engineering. General Terms Algorithms, Performance, Design. Keywords Parallel Computing, Cluster Applications, Multimedia, Sensing, Stream Processing, Computational Perception.
Ming-yu Chen, Lily B. Mummert, Padmanabhan Pillai,