In this paper we describe an approach that uses a combination of visual and audio features to cluster shots belonging to the same person together in video programs. We use color histograms extracted from keyframes and faces from shots as well as cepstral coefficients derived from audio to calculate pairwise shot distances. These distance are then normalized and combined to a single confidence value which reflects our certainty that the two shot contain the same person. We then use an agglomerative clustering algorithm to cluster shots based on these confidence values. We report the results of our system on a data set of approximately 8 hours of programming.
Alberto Albiol, Cüneyt M. Taskiran, Edward J.