Caption detection in the video is an active research topic in recent years. In the conventional methods, one of most difficult problems is to effectively and quickly extract the durations of the different-size captions in the complex background. To solve this problem, a novel and effective method is presented to locate and track the captions in the video. The main contributions are: (1)present a multi-scale Harris-corner based method to detect the initial position of the caption (2)propose the SGF (Steady Global Feature) to determine the caption duration. Extensive experiments demonstrate the effectiveness of the proposed method.