Recent work in object localization has shown that the use of contextual cues can greatly improve accuracy over models that use appearance features alone. Although many of these models have successfully explored different types of contextual sources, they only consider one type of contextual interaction (e.g., pixel, region or object level interactions), leaving open questions about the true potential contribution of context. Furthermore, contributions across object classes and over appearance features still remain unknown. In this work, we introduce a novel model for multiclass object localization that incorporates different levels of contextual interactions. We study contextual interactions at pixel, region and object level by using three different sources of context: semantic, boundary support and contextual neighborhoods. Our framework learns a single similarity metric from multiple kernels, combining pixel and region interactions with appearance features, and then uses a condition...