Realistic audio-visual mapping remains a very challenging problem. Having short time delay between inputs and outputs is also of great importance. In this paper, we present a new dynamic audio-visual mapping approach based on the Fused Hidden Markov Model Inversion method. In our work, the Fused HMM is used to model the loose synchronization nature of the two tightly coupled audio speech and visual speech streams explicitly. Given novel audio inputs, the inversion algorithm is derived to synthesize visual counterparts by maximizing the joint probabilistic distribution of the Fused HMM. When it is implemented in the subsets built from the training corpus, realistic synthesized facial animation having relative short time delay is obtained. Experiments on a 3D motion capture bimodal database show that the synthetic results are comparable with the ground truth.