Abstract. Having effective and efficient methods to get access to desired images is essential nowadays with the huge amount of digital images. This paper presents an analogy between content-based image retrieval and text retrieval. We make this analogy from pixels to letters, patches to words, sets of patches to phrases, and groups of sets of patches to sentences. To achieve a more accurate document matching, more informative features including phrases and sentences are needed to improve these scenarios. The proposed approach is based first on constructing different visual words using local patch extraction and description. After that, we study different association rules between frequent visual words in the context of local regions in the image to construct visual phrases, which will be grouped to different sentences.