In this paper we describe a method of learning hierarchical representations for describing and recognizing gestures expressed as one and two arm movements using competitive learning methods. At the low end of the hierarchy, the atomic motions (“letters”) corresponding to flow fields computed from successive color image frames are derived using Learning Vector Quantization (LVQ). At the next intermediate level, the atomic motions are clustered into actions (“words”) using homogeneity criteria. The highest level combines actions into activities (“sentences”) using proximity driven clustering. We demonstrate the feasibility and the robustness of our approach on real color-image sequences, each consisting of several hundred frames corresponding to dynamic one and two arm movements.