In this paper, we propose a novel method for generating engaging multi-modal content automatically from text. Rhetorical Structure Theory (RST) is used to decompose text into discourse units and to identify rhetorical discourse relations between them. Rhetorical relations are then mapped to question–answer pairs in an information preserving way, i.e., the original text and the resulting dialogue convey essentially the same meaning. Finally, the dialogue is “acted out” by two virtual agents. The network of dialogue structures automatically built up during this process, called DialogueNet, can be reused for other purposes, such as personalization or question–answering.