In realizing video retrieval system, the crucial point is how to provide an effective access method of video contents. This paper focuses on Japanese cooking instruction utterances and describes a method of analyzing structure of them, which leads to a summary of video. We detect a hierarchical structure of video contents by using linguistic and visual information. We found that the integration of visual information can improve the detection of task units better than using linguistic information alone.