One of the most desirable characteristics of an Embodied Conversational Agent (ECA) is the capability of interacting with users in a human-like manner. While listening to a user, an ECA should be able to provide backchannel signals through visual and acoustic modalities. In this work we propose an improvement of our previous system to generate multimodal backchannel signals on visual and acoustic modalities. A perceptual study has been performed to understand how context-free multimodal backchannels are interpreted by users.