The scope of this paper is the interpretation of a user's intention via a video camera and a speech recognizer. In comparison to previous work which only takes into account gesture recognition, we demonstrate that by including speech, system comprehension increases. For the gesture recognition, the user must wear a colored glove, then we extract the velocity of the center of gravity of the hand. A Hidden Markov Model (HMM) is learned for each gesture that we want to recognize. In a dynamic action, to know if a gesture has been performed or not, we implement a threshold model below which the gesture is not detected. The off line tests for gesture recognition have a success rate exceeding 85% for each gesture. The combination of speech and gestures is realized using Bayesian theory.