Spatio-temporal alignment of electronic slides with corresponding presentation video opens up a number of possibilities for making the instructional content more accessible and understandable, such as video quality improvement, better content analysis and novel compression approaches for low bandwidth access. However, these applications need finding accurate transformations between slides and video frames, which is quite challenging in capture settings using pan-tiltzoom (PTZ) cameras. In this paper we present a nonlinear optimization approach for accurate registration of slide images to video frames. Instead of estimating the projective transformation (i.e., homography) between a single pair of slide and frame images, we solve a set of homographies jointly in a frame sequence that is associated with a given slide. Quantitative evaluation confirms that this substantively improves alignment accuracy.