This paper describes the MSRA-USTC-SJTU experiments for TRECVID 2007. We performed the experiments in high-level feature extraction and automatic search tasks. For high-level feature extraction, we investigated the benefit of unlabeled data by semi-supervised learning, and the multi-layer (ML) multi-instance (MI) relation embedded in video by MLMI kernel, as well as the correlations between concepts by correlative multi-label learning. For automatic search, we fuse text, visual example, and concept-based models while using temporal consistency and face information for re-ranking and result refinement.