The scarcity of available multi-track recordings constitutes a severe constraint on the training of probabilistic models for voice extraction from polyphonic music. We propose a novel training method to estimate a spectral envelope of a singing voice that makes it possible to train the models from a polyphonic music without segregating a singing voice. We implement this method as an extension to the existing W-PST method, which concurrently estimates singing voice fundamental frequency (F0) and phoneme from polyphonic music. The novel training method is based on random sampling from probabilistic distributions. We conducted experiments on concurrent F0 and phoneme estimation and confirm the effectiveness of our method.