Automatic recognition of human affective states is still a largely unexplored and challenging topic. Even more issues arise when dealing with variable quality of the inputs or aiming for real-time, unconstrained, and person independent scenarios. In this paper, we explore audio-visual multimodal emotion recognition. We present SAMMI, a framework designed to extract real-time emotion appraisals from nonprototypical, person independent, facial expressions and vocal prosody. Different probabilistic method for fusion are compared and evaluated with a novel fusion technique called NNET. Results shows that NNET can improve the recognition score (CR+ ) of about 19% and the mean average precision of about 30% with respect to the best unimodal system.