Automatic emotion recognition has been widely studied and applied to various computer vision tasks (e.g. health monitoring, driver state surveillance, personalized learning, and security monitoring). As revealed by recent psychological and behavioral research, facial expressions are good in communicating categorical emotions (e.g. happy, sad, surprise, etc.), while bodily expressions could contribute more to the perception of dimensional emotional states (e.g. arousal and valence). In this paper, we propose a semi-feature level fusion framework that incorporates affective information of both the facial and bodily modalities to draw a more reliable interpretation of users’ emotional states in a valence–arousal space. The Genetic Algorithm is also applied to conduct automatic feature optimization. We subsequently propose an ensemble regression model to robustly predict users’ continuous affective dimensions in the valence–arousal space. The empirical findings indicate that by co...