This paper addresses gesture recognition under small sample size, where direct use of traditional classifiers is difficult due to high dimensionality of input space. We propose a pairwise feature extraction method of video volumes for classification. The method of Canonical Correlation Analysis is combined with the discriminant functions and Scale-InvariantFeature-Transform (SIFT) for the discriminative spatiotemporal features for robust gesture recognition. The proposed method is practically favorable as it works well with a small amount of training samples, involves few parameters, and is computationally efficient. In the experiments using 900 videos of 9 hand gesture classes, the proposed method notably outperformed the classifiers such as Support Vector Machine/Relevance Vector Machine, achieving 85% accuracy.