—Captions in videos play a significant role for automatically understanding and indexing video content, since much semantic information is associated with them. This paper presents an effective approach to extracting captions from videos, in which multiple different categories of features (edge, color, stroke etc.) are utilized, and the spatio-temporal characteristics of captions are considered. First, our method exploits the distribution of gradient directions to decompose a video into a sequence of clips temporally, so that each clip contains a caption at most, which makes the successive extraction computation more efficient and accurate. For each clip, the edge and corner information are then utilized to locate text regions. Further, text pixels are extracted based on the assumption that text pixels in text regions always have homogeneous color, and their quantity dominates the region relative to non-text pixels with different colors. Finally, the segmentation results are further ...