In this paper we explore the interlink between temporally dense view-based object recognition and sparse image representations with local keypoints. The temporal component is an add on that allows us to extract information which is distinctive of a given object in a given view-point range. We use temporal descriptions both for training and for testing. In the training phase each image sequence contains one object only, observed at different view points. At run time video shots are analyzed looking for known objects. Train and test video shots are represented by a structure of scale-space keypoints selected so that they are robust to view-point changes. In the matching phase we emphasize co-occurring keypoints and attenuate the importance of isolated points, both in the model and in the test representation. With our prototype recognition system we obtained very good results in controlled and unconstrained environments, setting the ground for real world applications such as automatic pl...