Structuring video data is necessary for its effective retrieval and summarization. In particular, collecting similar scenes from semantic aspects highly contributes to the structuring. In this paper, we propose a method of clustering the scenes with relevance feedback, which may be able to bridge the gap between the video data and its semantics. First, spatiotemporal video segments of a fixed length are clustered according to image features of each segment. Then, a user performs feedback to the results of clustering, whether each segment is relevant to the cluster it belongs to. The clustering accuracy can be improved through the interaction based on the feedback information. For diverse kinds of video streams, we investigated how the feedback should be given and demonstrated the effectiveness of the interactive clustering.