Untethered multimodal interfaces are more attractive than tethered ones because they are more natural and expressive for interaction. Such interfaces usually require robust vision-based body pose estimation and gesture recognition. In interfaces where a user is interacting with a computer using speech and arm gestures, the user’s spoken keywords can be recognized in conjuction with a hypothesis of body poses. This co-occurence can reduce the number of body pose hypothesis for the vision based tracker. In this paper we show that incorporating speech-based body pose constraints can increase the robustness and accuracy of vision-based tracking systems. Next, we describe an approach for gesture recognition. We show how Linear Discriminant Analysis (LDA), can be employed to estimate ‘good features’ that can be used in a standard HMM-based gesture recognition system. We show that, by applying our LDA scheme, recognition errors can be significantly reduced over a standard HMM-based te...