This work is about scene interpretation in the sense of detecting and localizing instances from multiple object classes. We concentrate on object indexing: generate an over-complete interpretation ? a list with extra detections but none missed. Pruning such an index to a final interpretation involves a global, often intensive, contextual analysis. We propose a tree-structured hierarchy as a framework for indexing; each node represents a subset of interpretations. This unifies object representation, scene parsing, and sequential learning (modifying the hierarchy as new samples, poses and classes are encountered). Then we specialize to learning - designing and refining a binary classifier at each node of the hierarchy dedicated to the corresponding subset of interpretations. The whole procedure is illustrated by experiments in reading license plates.