We consider the problem of computing the likelihood of a gesture from regular, unaided video sequences, without relying on perfect segmentation of the scene. Instead of requiring that low-and mid-level processes produce near-perfect segmentation of relevant body parts such as hands, we take into account that such processes can only produce uncertain information. The hands can only be detected as fragmented regions along with clutter. To address this problem, we propose an extension of the HMM formalism, which we call the frag-HMM, to allow for reasoning based on fragmented observations, via the use of an intermediate grouping process. In this formulation, we do not match the fragHMM to one observation sequence, but rather to a sequence of observation sets, where each observation set is a collection of groups of fragmented observations. Based on the developed model, we show how to perform three kinds of computations. The first one is to decide on the best observation group for each fra...