We are interested in recovering aspects of vocal tract’s geometry and dynamics from auditory and visual speech cues. We approach the problem in a statistical framework based on Hidden Markov Models and demonstrate effective estimation of the trajectories followed by certain points of interest in the speech production system. Alternative fusion schemes are investigated to account for asynchrony between the modalities and allow independent modeling of the dynamics of the involved streams. Visual cues are extracted from the speaker’s face by means of Active Appearance Modeling. We report experiments on the QSMT database which contains audio, video, and electromagnetic articulography data recorded in parallel. The results show that exploiting both audio and visual modalities in a multistream HMM based scheme clearly improves performance relative to either audio or visual-only estimation.