F0 is an acoustic feature that varies largely from one speaker to another. F0 is characterized by a discontinuity in the transition between voiced and unvoiced sounds that presents an obstacle to GMM modeling for use in voice conversion. A Multi-Space Distribution (MSD) [5] can be used to model unvoiced and voiced F0 regions in a linearly weighted mixture. However, the use of two incompatible probabilistic spaces, for example a continuous probability density for voiced observations, and a discrete probability for unvoiced observations, may result in an imprecise voiced/unvoiced (v/u) conversion in a maximum likelihood (ML) sense. In this paper we propose to use voicing strength, characterized by the normalized correlation coefficient magnitude, as calculated from F0 feature extraction, as an additional feature for improving F0 modeling and the v/u decision in the context of voice conversion. The proposed method was evaluated on male-to-female voice conversion tasks in both Mandarin a...
Aki Kunikoshi, Yao Qian, Frank K. Soong, Nobuaki M