Abstract. Multimodal human to human interaction requires integration of the contents/meaning of the modalities involved. Artificial Intelligence (AI) multimodal prototypes attempt to go beyond technical integration of modalities to this kind of meaning integration that allows for coherent, natural, “intelligent” communication with humans. Though bringing many multimedia-related AI research fields together, integration and in particular vision-language integration is an issue that remains still in the background. In this paper, we attempt to make up for this lacuna by shedding some light on how, why and to what extent vision-language content integration takes place within AI. We present a taxonomy of vision-language integration prototypes which resulted from an extensive survey of such prototypes across a wide range of AI research areas and which uses a prototype’s integration purpose as the guiding criterion for classification. We look at the integration resources and mechanis...