This paper discusses object-based representation of video shots acquired by a moving camera. Our approach uses an extraction of foreground regions capable of representing semantic objects of interest. However, foreground regions extracted by motion compensation are not always representative of the entity they depict. A filtering and a clustering of these regions allow us to retain only the most representative of each real object in the shot, i.e. the key-object.