We describe a system to learn an object template from a video stream, and localize and track the corresponding object in live video. The template is decomposed into a number of local descriptors, thus enabling detection and tracking in spite of partial occlusion. Each local descriptor aggregates contrast invariant statistics (normalized intensity and gradient orientation) across scales, in a way that enables matching under significant scale variations. Lowlevel tracking during the training video sequence enables capturing object-specific variability due to the shape of the object, which is encapsulated in the descriptor. Salient locations on both the template and the target image are used as hypotheses to expedite matching.