Virtual conversational agents are supposed to combine speech with nonverbal modalities for intelligible and believeable utterances. However, the automatic synthesis of coverbal gestures still struggles with several problems like naturalness in procedurally generated animations, flexibility in pre-defined movements, and synchronization with speech. In this paper, we focus on generating complex multimodal utterances including gesture and speech from XMLbased descriptions of their overt form. We describe a coordination model that reproduces co-arcticulation and transition effects in both modalities. In particular, an efficient kinematic approach to creating gesture animations from shape specifications is presented, which provides fine adaptation to temporal constraints that are imposed by crossmodal synchrony.