Given an image, we propose a hierarchical generative
model that classifies the overall scene, recognizes and segments
each object component, as well as annotates the image
with a list of tags. To our knowledge, this is the first
model that performs all three tasks in one coherent framework.
For instance, a scene of a ‘polo game’ consists of
several visual objects such as ‘human’, ‘horse’, ‘grass’, etc.
In addition, it can be further annotated with a list of more
abstract (e.g. ‘dusk’) or visually less salient (e.g. ‘saddle’)
tags. Our generative model jointly explains images through
a visual model and a textual model. Visually relevant objects
are represented by regions and patches, while visually
irrelevant textual annotations are influenced directly
by the overall scene class. We propose a fully automatic
learning framework that is able to learn robust scene models
from noisy web data such as images and user tags from
Flickr.com. We demonstrate...