This paper explores the relationship between discourse structure and coverbal gesture. Using the idea of gestural cohesion, we show that coherent topic segments are characterized by homogeneous gestural forms, and that changes in the distribution of gestural features predict segment boundaries. Gestural features are extracted automatically from video, and are combined with lexical features in a hierarchical Bayesian model. Unsupervised inference is performed through Metropolis-Hastings sampling. The resulting multimodal system outperforms a verbal-only model, both with manual and automaticallyrecognized speech transcripts.