In this paper we propose and test an action recognition algorithm in which the images of the scene captured by a significant number of cameras are first used to generate a volumetric representation of a moving human body in terms of voxsets by means of volumetric intersection. The recognition stage is then performed directly on 3D data, allowing the system to avoid critical problems like viewpoint dependence and motion trajectory variability. Suitable features are extracted from the voxset representing the body, and fed to a classical hidden Markov model to produce a finite-state description of the motion.