This paper proposes a method for separating the signals of individual musical instruments from monaural musical audio. The mixture signal is modeled as a sum of the spectra of individual musical sounds which are further represented as a product of excitations and filters. The excitations are restricted to harmonic spectra and their fundamental frequencies are estimated in advance using a multipitch estimator, whereas the filters are restricted to have smooth frequency responses by modeling them as a sum of elementary functions on Mel-frequency scale. A novel expectation-maximization (EM) algorithm is proposed which jointly learns the filter responses and organizes the excitations (musical notes) to filters (instruments). In simulations, the method achieved over 5 dB SNR improvement compared to the mixture signals when separating two or three musical instruments from each other. A slight further improvement was achieved by utilizing musical properties in the initialization of the a...