Video provides not only rich visual cues such as motion and appearance, but also much less explored long-range temporal interactions among objects. We aim to capture such interactions and to construct a powerful intermediatelevel video representation for subsequent recognition. Motivated by this goal, we seek to obtain spatio-temporal oversegmentation of a video into regions that respect object boundaries and, at the same time, associate object pixels over many video frames. The contributions of this paper are two-fold. First, we develop an efficient spatiotemporal video segmentation algorithm, which naturally incorporates long-range motion cues from the past and future frames in the form of clusters of point tracks with coherent motion. Second, we devise a new track clustering cost function that includes occlusion reasoning, in the form of depth ordering constraints, as well as motion similarity along the tracks. We evaluate the proposed approach on a challenging set of video sequen...