We describe a system for automatically extracting dynamics of tongue gestures from ultrasound images of the tongue using translational deep belief networks (tDBNs). In tDBNs, a joint model of the input and output vectors are learned during a generative pretraining stage, and then a translation step is used to transform input-only vectors into this joint representation. A final fine-tuning stage is then used to reconstruct the desired outputs given input vectors. We show that this technique dramatically improves performance on segmenting ultrasound image sequences of continuous speech into individual consonant gestures compared with the original DBN method of [1] as well as alternative methods using PCA and support vector machines.