We propose a unified approach for summarization based on the analysis of video structures and video highlights. Our approach emphasizes both the content balance and perceptual quality of a summary. Normalized cut algorithm is employed to globally and optimally partition a video into clusters. A motion attention model based on human perception is employed to compute the perceptual quality of shots and clusters. The clusters, together with the computed attention values, form a temporal graph similar to Markov chain that inherently describes the evolution and perceptual importance of video clusters. In our application, the flow of a temporal graph is utilized to group similar clusters into scenes, while the attention values are used as guidelines to select appropriate sub-shots in scenes for summarization.