Multimedia documents are increasingly used which involve to develop model to that kind of data. In this paper we present a multimedia model which combines textual and visual information. Using a bag of words approach, we can represent a textual and visual document with a vector for each modality. Given a multimedia query, our model lets us linearly combine scores obtained for each modality and return a list of relevant retrieved documents. This article aims at studying the influence of the weight given to the visual information according to the textual one. Experiments on the multimedia ImageCLEF collection extracted from Wikipedia show that results can be improved by learning this weight parameter. MOTS-CL