In this paper, we propose a computational scheme named an interval-based linear hybrid dynamical system (ILHDS) to represent complex dynamic events based on temporal intervals, each of which is characterized by linear dynamics and its duration. We then propose a cross-media timingstructure model to represent dynamic structures among multiple media signals based on the relation of temporal intervals described by multiple ILHDSs. To evaluate the proposed scheme, we conducted experiments on media conversion that generates lip video from an input audio signal.