Though everyday interaction is predominantly multimodal, a purpose-developed framework for describing the semantic interplay between verbal and non-verbal communication is still lacking. This lack not only indicates one's poor understanding of multimodal human behaviour, but also weakens any attempt to model such behaviour computationally.Inthisarticle,wepresentCOSMOROE,acorpus-based framework for describing semantic interrelations between images, language and body movements. We argue that in viewing such relations from a message-formation perspective rather than a communicative goal one, one may develop a framework with descriptive power and computational applicability. We test COSMOROE for compliance to these criteria, by using it for annotating a corpus of TV travel programmes; we present all particulars of the annotation process and conclude with a discussion on the usability and scope of such annotated corpora. Keywords Cross-media relations