Abstract--This paper presents a novel method for automatically classifying consumer video clips based on their soundtracks. We use a set of 25 overlapping semantic classes, chosen for their usefulness to users, viability of automatic detection and of annotator labeling, and sufficiency of representation in available video collections. A set of 1, 873 videos from real users has been annotated with these concepts. Starting with a basic representation of each video clip as a sequence of MFCC frames, we experiment with three clip-level representations: Single Gaussian Modeling, Gaussian Mixture Modeling, and Probabilistic Latent Semantic Analysis of a Gaussian Component Histogram. Using such summary features, we produce SVM classifiers based on the Kullback-Leibler, Bhattacharyya, or Mahalanobis distance measures. Quantitative evaluation shows that our approaches are effective for detecting interesting concepts in a large collection of real-world consumer video clips.
Keansub Lee, Daniel P. W. Ellis