We present an algorithm to improve trajectory estimation in networks of non-overlapping cameras using audio measurements. The algorithm fuses audiovisual cues in each camera’s field of view and recovers trajectories in unobserved regions using microphones only. Audio source localization is performed using Stereo Audio and Cycloptic Vision (STAC) sensor by estimating the time difference of arrival (TDOA) between microphone pair and then by computing the cross correlation. Audio estimates are then smoothed using Kalman filtering. The audio-visual fusion is performed using a dynamic weighting strategy. We show that using a multi-modal sensor with combined visual (narrow) and audio (wider) field of view can enable extended target tracking in non-overlapping camera settings. In particular, the weighting scheme improves performance in the overlapping regions. The algorithm is evaluated in several multi-sensor configurations using synthetic data and compared with state of the art algor...