Supervised learning can be used to create good systems for note segmentation in audio data. However, this requires a large set of labeled training examples, and handlabeling is quite difficult and time consuming. A bootstrap approach is introduced in which audio alignment techniques are first used to find the correspondence between a symbolic music representation (such as MIDI data) and an acoustic recording. This alignment provides an initial estimate of note boundaries which can be used to train a segmenter. Once trained, the segmenter can be used to refine the initial set of note boundaries and training can be repeated. This iterative training process eliminates the need for hand-segmented audio. Tests show that this training method can improve a segmenter initially trained on synthetic data.
Ning Hu, Roger B. Dannenberg