We propose to use a visual object (e.g., the baseball catcher) detection algorithm to find local, semantic objects in video frames in addition to an audio classification algorithm to find semantic audio objects in the audio track for sports highlights extraction. The highlight candidates are then further grouped into finer-resolution highlight segments, using color or motion information. During the grouping phase, many of the false alarms can be correctly identified and eliminated. Our experimental results with baseball, soccer and golf video are promising.