In this paper, we propose a coherent framework for joint key-frame extraction and object-based video segmentation. Conventional key-frame extraction and object segmentation are usually implemented independently and separately due to the fact that they are on different semantic levels. This ignores the inherent relationship between key-frames and objects. The proposed method extracts a small number of keyframes within a shot so that the divergence between video objects in a feature space can be maximized, supporting robust and efficient object segmentation. This method can utilize advantages of both temporal and object-based video segmentations, and be helpful to build a unified framework for content-based analysis and structured video representation. Theoretical analysis and simulation results on both synthetic and real video sequences manifest the efficiency and robustness of the proposed method.