Although gesture recognition has been studied extensively, communicative, affective, and biometrical “utility” of natural gesticulation remains relatively unexplored. One of the main reasons for that is the modeling complexity of spontaneous gestures. While lexical information in speech provides additional cues for disambiguating gestures, it does not cover rich paralinguistic domain. This paper offers initial findings from a large corpus of natural monologues about prosodic structuring between frequent beat-like strokes and concurrent speech. Using a set of audio-visual features in an HMM-based formulation, we are able to improve the discrimination between visually similar gestures. Those types of articulatory strokes represent different communicative functions. The analysis is based on the temporal alignment of detected vocal perturbations and the concurrent hand movement. As a supplementary result, we show that recognized articulatory strokes may be used for quantifying gesturi...