One of the first steps towards understanding natural multimodal language is aligning gesture and speech, so that the appropriate gestures ground referential pronouns in the speech. This paper presents a novel technique for gesture-speech alignment, inspired by saliencebased approaches to anaphoric pronoun resolution. We use a hybrid between data-driven and knowledge-based mtehods: the basic structure is derived from a set of rules about gesture salience, but the salience weights themselves are learned from a corpus. Our system achieves 95% recall and precision on a corpus of transcriptions of unconstrained multimodal monologues, significantly outperforming a competitive baseline.