We present a method to learn visual attributes (eg.“red”,
“metal”, “spotted”) and object classes (eg. “car”,
“dress”, “umbrella”) together. We assume images are labeled
with category, but not location, of an instance. We
estimate models with an iterative procedure: the current
model is used to produce a saliency score, which, together
with a homogeneity cue, identifies likely locations for the
object (resp. attribute); then those locations are used to
produce better models with multiple instance learning. Crucially,
the object and attribute models must agree on the
potential locations of an object. This means that the more
accurate of the two models can guide the improvement of
the less accurate model.
Our method is evaluated on two data sets of images of
real scenes, one in which the attribute is color and the other
in which it is material. We show that our joint learning produces
improved detectors. We demonstrate generalization
by detecting a...