We explore the problem of rapid automatic semantic tagging of video frames of unstructured (unedited) videos. We apply the Sort-Merge algorithm for feature selection on a large (>1000) heterogeneous feature set for videos showing lectures, to quickly locate low-level image features most predictive for concepts such as "key frame with text" or "key frame with computer source code". For evaluation, we introduce a "keeper" heuristic for feature retention, which provides a baseline comparison. We then compare early fusion and late fusion of diverse feature types; based on experiments on 12,395 frames, we find that in general late fusion offers higher Average Precision accuracy at lower computation cost, compared to early fusion. However, mergers of redundant feature types do not necessarily improve performance over single feature types; exploration of both merged and unmerged performance is necessary.
Mitchell J. Morris, John R. Kender