We propose a multi-modal object tracking algorithm that combines appearance, motion and audio information in a particle filter. The proposed tracker fuses at the likelihood level the audio-visual observations captured with a video camera coupled with two microphones. Two video likelihoods are computed that are based on a 3D color histogram appearance model and on a color change detection, whereas an audio likelihood provides information about the direction of arrival of a target. The direction of arrival is computed based on a multi-band generalized cross-correlation function enhanced with a noise suppression and reverberation filtering that uses the precedence effect. We evaluate the tracker on single and multi-modality tracking and quantify the performance improvement introduced by integrating audio and visual information in the tracking process.