We present a new, sophisticated algorithm to select suitable training images for our biologically motivated attention system VOCUS. The system detects regions of interest depending on bottom-up (scenedependent) and top-down (target-specific) cues. The top-down cues are learned by VOCUS from one or several training images. We show that our algorithm chooses a subset of the training set that outperforms both the selection of one single image as well as simply using all available images for learning. With this algorithm, VOCUS is able to quickly and robustly detect targets in numerous real-world scenes.