We propose a new graph-based data structure, called Spatio Temporal Region Graph (STRG) which can represent the content of video sequence. Unlike existing ones which consider mainly spatial information in the frame level of video, the proposed STRG is able to formulate its temporal information in the video level additionally. After an STRG is constructed from a given video sequence, it is decomposed into its subgraphs called Object Graphs (OGs), which represent the temporal characteristics of video objects. For unsupervised learning, we cluster similar OGs into a group, in which we need to match two OGs. For this graph matching, we introduce a new distance measure, called Extended Graph Edit Distance (EGED), which can handle the temporal characteristics of OGs. For actual clustering, we exploit Expectation Maximization (EM) with EGED. The experiments have been conducted on real video streams, and their results show the effectiveness and robustness of the proposed schemes.