Widespread current cameras are part of multisensory systems with an integrated computer (smartphones). Computer vision thus starts evolving to cross-modal sensing, where vision an...
Most existing web video search engines index videos by file names, URLs, and surrounding texts. These types of video roughly describe the whole video in an abstract level without ...
Automatically extracting semantic content from audio streams can be helpful in many multimedia applications. Motivated by the known limitations of traditional supervised approache...
Word prediction performed by language models has an important role in many tasks as e.g. word sense disambiguation, speech recognition, hand-writing recognition, query spelling an...
The paper describes the IBM systems submitted to the NIST Rich Transcription 2007 (RT07) evaluation campaign for the speechto-text (STT) and speaker-attributed speech-to-text (SAST...