In this article, we focus on the problem of caption detection in video sequences. Contrary to most of existing approaches based on a single detector followed by an ad hoc and costly post-processing, we have decided to consider several detectors and to merge their results in order to combine advantages of each one. First we made a study of captions in video sequences to determine how they are represented in images and to identify their main features (color constancy and background contrast, edge density and regularity, temporal persistence). Based on these features, we then select or define the appropriate detectors and we compare several fusion strategies which can be involved. The logical process we have followed and the satisfying results we have obtained let us validate our contribution.