To leverage large-scale weakly-tagged images for computer vision tasks (such as object detection and scene recognition), a novel cross-modal tag cleansing and junk image filtering algorithm is developed for cleansing the weaklytagged images and their social tags (i.e., removing irrelevant images and finding the most relevant tags for each image) by integrating both the visual similarity contexts between the images and the semantic similarity contexts between their tags. Our algorithm can address the issues of spams, polysemes and synonyms more effectively and determine the relevance between the images and their social tags more precisely, thus it can allow us to create large amounts of training images with more reliable labels by harvesting from large-scale weakly-tagged images, which can further be used to achieve more effective classifier training for many computer vision tasks.