In this paper, we present a novel framework for machine learning-based cross-media knowledge extraction. The framework is specifically designed to handle documents composed of three types of media – text, images and raw data – and to exploit the evidence for an extracted fact from across the media. We validate the framework by applying it in the design and development of cross-media extraction systems in the context of two real-world use cases in the car manufacturing industry. Moreover, we show that in these use cases the cross-media approach effectively improves system extraction accuracy.