Generally the bag-of-words based image representation
follows a bottom-up paradigm. The subsequent stages of the
process: feature detection, feature description, vocabulary
construction and image representation are performed independent
of the intentioned object classes to be detected. In
such a framework, combining multiple cues such as shape
and color often provides below-expected results.
This paper presents a novel method for recognizing object
categories when using multiple cues by separating the
shape and color cue. Color is used to guide attention by
means of a top-down category-specific attention map. The
color attention map is then further deployed to modulate the
shape features by taking more features from regions within
an image that are likely to contain an object instance. This
procedure leads to a category-specific image histogram representation
for each category. Furthermore, we argue that
the method combines the advantages of both early and late
fusion....