This paper explores the use of alternating sequential patterns of local features and saccading actions to learn robust and compact object representations. The temporal encoding represents the spatial relations between local features. We view the problem of object recognition as a sequential prediction task. Our method uses a Discriminative Variable Memory Markov (DVMM) model that precisely captures underlying characteristics of multiple statistical sources that generate sequential patterns in a stochastic manner. By pruning out long sequential patterns when there is no further information gain over shorter and discriminative ones, the DVMM model is able to represent multiple objects succinctly. Experimental results show that the DVMM model performs significantly better compared to various other supervised learning algorithms that use a bagof-features approach.