We present a new content-based approach to summarize instructional videos. We first redefine "scene" in instructional videos. Focusing on one dominant scene type, that of handwritten lecture notes, we define semantic content as "ink pixels", and present a low-level retrieval technique to extract this content from each frame with consideration of various occlusion and illumination effects. "Key frames" in this video genre are redefined as set of frames that cover the semantic content, and the fluctuating amount of visible ink is used to drive a heuristic real-time key frame extraction method. A rule-based method is also provided to synchronize key frames with audio. We extend our method to the extraction of key frame hierarchies. We show its application to a 17-minute (30K frames) instructional video sequence, resulting in seven key frames. These techniques create tunable instructional summaries over a wide and dynamically varying range of compression fact...
Tiecheng Liu, John R. Kender