For multimedia interpretation, and in particular for the combined interpretation of information coming from different modalities, a semantically well-founded formalization is required in the context of an agent-based scenario. Low-level percepts, which are represented symbolically, define the observations of an agent, and interpretations of content are defined as explanations for the observations. We propose an abduction-based formalism that uses description logics for the ontology and Horn rules for defining the space of hypotheses for explanations (i.e., the space of possible interpretations of media content), and we use Markov logic to define the motivation for the agent to generate explanations on the one hand, and for ranking different explanations on the other. This work has been funded by the European Community with the project CASAM (Contract FP7-217061 CASAM) and by the German Science Foundation with the project PRESINT (DFG MO 801/1-1).