This paper presents intermodal collaboration: a strategy for semantic content analysis for broadcasted sports video. The broadcasted video can be viewed as a set of multimodal streams such as visual, auditory, text (closed caption) and graphics streams. Collaborative analysis for the multimodal streams is achieved based on temporal dependency between their streams, in order to improve the reliability and efficiency for semantic content analysis such as extracting highlight scenes from sports video and automatically generating annotations of specific scenes. A couple of case studies are shown to experimentally confirm the effectiveness of intermodal collaboration.