In this paper, an HMM-embedded unsupervised learning approach is proposed to detect the music events by grouping the similar segments of the music signal. This approach can cluster the segments based on their similarity of the spectral as well as the temporal structures. This is not easily done for clustering with the traditional similarity measures. Together with a Bayesian information criterion, the proposed approach can obtain a suitable event set to regularize the complexity of the model structure. The natural product of the approach is a set of music events modeled by the HMMs. Our experimental analyses show that the detected musical events have more perceptual meaning and are more consistent than the KL-distance based clustering. The learned events match better with our experience in spectrogram reading. Its capacity is further evaluated on a task of music identification. The identification error rate is