Computer vision and artificial intelligence research has long danced around the subject of causality: vision researchers use causal relationships to aid action detection, and AI researchers propose methods for causal induction independent of video sensors. In this paper, we argue that learning perceptual causality from video is a necessary step for understanding scenes in video. We explain how current object and action detection is suffering without causality, and we explain how current causality research is suffering without grounding on raw sensors. We then go on to describe one plausible solution for grounding perceptual causality on raw sensors. Applying causal knowledge to vision research provides a much deeper level of understanding than considering actions and objects independently. Causal understanding enables joint spatial-temporal-causal inference (allowing causal information to connect spatial and temporal domains). With joint inference, it becomes possible to infer misdet...