This paper introduces a novel method for capturing conversation scenes and editing of the obtained videos. This system aims to acquire high-quality videos and to utilize them for multimedia that provides conversation functions. First, a video capturing system composed of “environmental camera module” and “contents capturing camera modules” is introduced. Next, a novel computational video editing model based on optimization and constraint-satisfaction is introduced. This model produces various editing results by choosing the parameters adjusted for the various purpose of videos. With those bases, we introduce our approach for realizing multimedia contents that answer and interact with users in conversational environments.