In this paper, we introduce a system that synthesizes the emotional audio-visual speech for a 3-D talking agent by adopting the PAD (Pleasure-Arousal-Dominance) emotional model. A GMM-based method is introduced to predict variation of acoustic features for emotional speech by PAD values, and a parametric framework of PAD-driven emotional facial expression synthesis is built. As the focus of this paper, we performed a series of perceptual evaluations to understand the reinforcement effect of vocal and facial expression of emotion, and investigated the usefulness and effectiveness of the emotional talking agent in human computer speech communications. Three questions are addressed: 1) To what extent do different interfaces affect human's comprehension of emotion? 2) How accurate the emotional information is conveyed by the talking agent? 3) Is the multimodal (audio-visual) interface helpful to human's emotion comprehension? An evaluation involving 19 participants was conducted ...