This paper presents a framework for recognising realistic human actions captured from unconstrained environments. The novelties of this work lie in three aspects. First, we propose a new action representation based on computing a rich set of descriptors from key point trajectories. Second, in order to cope with drastic changes in motion characteristics with and without camera movements, we develop an adaptive feature fusion method to combine different local motion descriptors for improving model robustness against feature noise and background clutters. Finally, we propose a novel Multi-Class Delta Latent Dirichlet Allocation model for feature selection. The most informative features in a high dimensional feature space are selected collaboratively, rather than independently as by existing feature selection methods. Extensive experiments on challenging public datasets demonstrate the effectiveness of the proposed framework.