We are developing a testbed for learning by demonstration combining spoken language and sensor data in a natural real-world environment. Microsoft Kinect RGBDepth cameras allow us...
Scalable approaches to video content classification are limited by an inability to automatically generate representations of events ode abstract temporal structure. This paper pre...
—This paper focuses on Audio Event Detection (AED), a research area which aims to substantially enhance the access to audio in multimedia content. With the ever-growing quantity ...
Virginia Barbosa, Thomas Pellegrini, Miguel Bugalh...
We introduce the first visual dataset of fast foods with a total of 4,545 still images, 606 stereo pairs, 303 3600 videos for structure from motion, and 27 privacy-preserving vide...
Abstract. The complexity of visual representations is substantially limited by the compositional nature of our visual world which, therefore, renders learning structured object mod...