Multi-modality is a fundamental feature that characterizes biological systems and lets them achieve high robustness in understanding skills while coping with uncertainty. Relatively recent studies showed that multi-modal learning is a potentially effective add-on to artificial systems, allowing the transfer of information from one modality to another. In this paper we propose a general architecture for jointly learning visual and motion patterns: by means of regression theory we model a mapping between the two sensorial modalities improving the performance of artificial perceptive systems. We present promising results on a case study of grasp classification in a controlled setting and discuss future developments. Key words: multi-modality, visual and sensor-motor patterns, regression theory, behavioural model, objects and actions recognition