Robust, real-time tracking of objects from visual data requires probabilistic fusion of multiple visual cues. Previous approaches have either been ad hoc or relied on a Bayesian network with discrete spatial variables which suffers from discretisation and computational complexity problems. We present a new Bayesian modality fusion network that uses continuous domain variables. The network architecture distinguishes between cues that are necessary or unnecessary for the object's presence. Computationally expensive and inexpensive modalities are also handled differently to minimise cost. The method provides a formal, tractable and robust probabilistic method for simultaneously tracking multiple objects. While instantaneous inference is exact, approximation is required for propagation over time.