High-level, or holistic, scene understanding involves
reasoning about objects, regions, and the 3D relationships
between them. This requires a representation above the
level of pixels that can be endowed with high-level attributes
such as class of object/region, its orientation, and
(rough 3D) location within the scene. Towards this goal, we
propose a region-based model which combines appearance
and scene geometry to automatically decompose a scene
into semantically meaningful regions. Our model is defined
in terms of a unified energy function over scene appearance
and structure. We show how this energy function can be
learned from data and present an efficient inference technique
that makes use of multiple over-segmentations of the
image to propose moves in the energy-space. We show, experimentally,
that our method achieves state-of-the-art performance
on the tasks of both multi-class image segmentation
and geometric reasoning. Finally, by understanding
region classes...