In this paper, we present a joint multimodal (audio, visual and text) framework to map the informational complexity of the media elements to comprehension time. The problem is important for interactive multimodal presentations. We propose the joint comprehension time to be a function of the media Kolmogorov complexity. For audio and images, the complexity is estimated using a lossless universal coding scheme. The text complexity is derived by analyzing the sentence structure. For all three channels, we conduct user-studies to map media complexity to comprehension time. For estimating the joint comprehension time, we assume channel independence resulting in a conservative comprehension time estimate. The time for the visual channels (text and images) are deemed additive, and the joint time is then the maximum of the visual and the auditory comprehension times. The user studies indicate that the model works very well, when compared with fixed-time multimodal presentations.