In this paper we propose a new method that addresses the problem of tracking the bitmap (silhouette) of an object in a video under very general conditions. We assume a general target, possibly non rigid, with no prior information except initialization. The target, as well as the background, may change its appearance over time and the camera may move arbitrarily. The proposed algorithm fuses different visual cues by means of a conditional random field. The target's bitmap is estimated every frame by incorporating temporal color similarity, spatial color continuity and spatial motion continuity into an energy function that is minimized via min-cut. The spatial motion continuity is incorporated in the energy function in multiple image resolutions by a novel multi-scale energy term. Experiments demonstrate the robustness of our method and its advantage over other algorithms.