Current speech recognition systems are often based on HMMs with state-clustered Gaussian Mixture Models (GMMs) to represent the context dependent output distributions. Though highly successful, the standard form of model does not exploit any relationships between the states, they each have separate model parameters. This paper describes a general class of model where the context-dependent state parameters are a transformed version of one, or more, canonical states. A number of published models sit within this framework, including, semi-continuous HMMs, subspace GMMs and the HMM error model. A set of preliminary experiments illustrating some of this model's properties using CMLLR transformations from the canonical state to the context dependent state are described.
Mark J. F. Gales, Kai Yu