We introduce a new model for extracting classified structural segments, such as intro, verse, chorus, break and so forth, from recorded music. Our approach is to classify signal frames on the basis of their audio properties and then to agglomerate contiguous runs of similarly classified frames into texturally homogenous (or ‘self-similar’) segments which inherit the classificaton of their consituent frames. Our work extends previous work on automatic structure extraction by addressing the classification problem using using an unsupervised Bayesian clustering model, the parameters of which are estimated using a variant of the expectation maximisation (EM) algorithm which includes deterministic annealing to help avoid local optima. The model identifies and classifies all the segments in a song, not just the chorus or longest segment. We discuss the theory, implementation, and evaluation of the model, and test its performance against a ground truth of human judgements. Using an...
Samer A. Abdallah, Katy Noland, Mark B. Sandler, M