In this paper, we propose a unified graphical-model framework to interpret a scene composed of multiple objects in monocular video sequences. Using a single pairwise Markov random field (MRF), all the observed and hidden variables of interest such as image intensities, pixels' states (associated object's index and relative depth), objects' states (model motion parameters and relative depth) are jointly considered. Particular attention is given to occlusion handling by introducing a rigorous visibility modeling within the MRF formulation. Through minimizing the MRF's energy, we simultaneously segment, track and sort by depth the objects. Promising experimental results demonstrate the potential of this framework and its robustness to image noise, cluttered background, moving camera and background, and even complete occlusions.