Combining information from the higher level and the lower level has long been recognized as an essential component in holistic image understanding. However, an efficient inference method for multi-level models remains an open problem. Moreover, modeling the complex relations within real world images often gives rise to energy terms that couple many variables in arbitrary ways. They make the inference problem even harder. In this paper, we construct an energy function over the pose of the human body and pixel-wise foreground / background segmentation. The energy function incorporates terms both on the higher level, which models the human poses, and the lower level, which models the pixels. It also contains an intractable term that couples all body parts. We show how to optimize this energy in a principled way by relaxed dual decomposition, which proceeds by maximizing a concave lower bound on the energy function. Empirically, we show that our approach improves the state-of-the-art per...