Existing action recognition approaches mainly rely on the discriminative power of individual local descriptors extracted from spatio-temporal interest points (STIP), while the geometric relationships among the local features1 are ignored. This paper presents new features, called pairwise features (PWF), which encode both the appearance and the spatio-temporal relations of the local features for action recognition. First STIPs are extracted, then PWFs are constructed by grouping pairs of STIPs which are both close in space and close in time. We propose a combination of two codebooks for video representation. Experiments on two standard human action datasets: the KTH dataset and the Weizmann dataset show that the proposed approach outperforms most existing methods. Key-words: action recognition, local features, pairwise features.