Tracking body poses of multiple persons in monocular video is a challenging problem due to the high dimensionality of the state space and issues such as inter-occlusion of the persons' bodies. We proposed a three-stage approach with a multi-level state representation that enables a hierarchical estimation of 3D body poses. At the first stage, humans are tracked as blobs. In the second stage, parts such as face, shoulders and limbs are estimated and estimates are combined by grid-based belief propagation to infer 2D joint positions. The derived belief maps are used as proposal functions in the third stage to infer the 3D pose using data-driven Markov chain Monte Carlo. Experimental results on realistic indoor video sequences show that the method is able to track multiple persons during complex movement such as turning movement with interocclusion.