We present a novel approach to the causal temporal analysis of event data from video content. Our key observation is that the sequence of visual words produced by a space-time dictionary representation of a video sequence can be interpreted as a multivariate point-process. By using a spectral version of the pairwise test for Granger causality, we can identify patterns of interactions between words and group them into independent causal sets. We demonstrate qualitatively that this produces semanticallymeaningful groupings, and we demonstrate quantitatively that these groupings lead to improved performance in retrieving and classifying social games from unstructured videos.