In this paper, we propose a novel graph embedding method for the problem of lipreading. To characterize the temporal connections among video frames of the same utterance, a new distance metric is defined on a pair of frames and graphs are constructed to represent the video dynamics based on the distances between frames. Audio information is used to assist in calculating such distances. For each utterance, a subspace of the visual feature space is learned from a well-defined intrinsic and penalty graph within a graph-embedding framework. Video dynamics are found to be well preserved along some dimensions of the subspace. Discriminatory cues are then decoded from curves of the projected visual features to classify different utterances.