In this paper, we propose a novel approach for learning generic visual vocabulary. We use diffusion maps to au-tomatically learn a semantic visual vocabulary from ab-undant quantized midlevel features. Each midlevel feature is represented by the vector of pointwise mutual informa-tion (PMI). In this midlevel feature space, we believe the features produced by similar sources must lie on a certain manifold. To capture the intrinsic geometric relations be-tween features, we measure their dissimilarity using diffu-sion distance. The underlying idea is to embed the midlevel features into a semantic lower-dimensional space. Our goal is to construct a compact yet discriminative semantic visual vocabulary.
Although the conventional approach using k-means is good for vocabulary construction, its performance is sen-sitive to the size of the visual vocabulary. In addition, the learnt visual words are not semantically meaningful since the clustering criterion is based on appearance similarity onl...