This paper describes a novel method for content extraction and scene retrieval for video sequences based on local region descriptors. The local invariant features are obtained for all frames in a sequence and tracked throughout the shot to extract stable features. The scenes in a shot are represented by these stable features rather than features from one or more key frames. Compared to previous key frame based approaches, the proposed method is highly robust to camera and object motions and can withstand severe illumination changes. The proposed approach is applied to scene retrieval experiments and excellent performance is demonstrated.