This paper describes a new representation for the audio and visual information in a video signal. We reduce the dimensionality of the signals with singular-value decompositions (SVD) or mel-frequency cepstral coefficients (MFCC). We apply these transforms to word, (word transcript, semantic space or latent semantic indexing), image (color histogram data) and audio (timbre) data. Using scale-space techniques we find large jumps in a video's path, which are evidence for events. We use these techniques to analyze the temporal properties of the audio and image data in a video. This analysis creates a hierarchical segmentation of the video, or a table-of-contents, from both audio and the image data.
Malcolm Slaney, Dulce B. Ponceleon, James Kaufman