Images without annotations are ubiquitous on the Internet, and recommending tags for them has become a challenging open task in image understanding. A common bottleneck of related work is the semantic gap between the image and text representations. In this paper, we bridge the gap by introducing a semantic layer, the space of word embeddings that represents the image tags as the word vectors. Our model first learns the optimal mapping from the visual space to the semantic space using training sources. Then we annotate test images by decoding the semantic representations of the visual features. Extensive experiments demonstrate that our model outperforms the state-of-the-art approaches in predicting the image tags. Categories and Subject Descriptors H.3.1 [Information Storage And Retrieval]: Content Analysis and Indexing—indexing methods General Terms Algorithms, Performance, Experimentation Keywords Cross-modal study, image tagging, semantic representation, optimization model, word...