In this paper, we propose a framework to model video sequences using spatiotemporal description of video shots. Spatiotemporal volumes are extracted thanks to an efficient segmentation algorithm. Video shots are described by building an adjacency graph which models the visual properties of the volumes and the spatiotemporal relationships between them. The cost of extracting visual descriptors for the whole shot is reduced by efficiently propagating and merging region descriptors on spatiotemporal volumes. For the comparison of video shots, we propose a similarity measure which tolerates variability in the spatiotemporal representation. Promising experimental results are observed on different visual video shot categories.