In this paper we introduce visual phrases, complex visual composites like “a person riding a horse”. Visual phrases often display significantly reduced visual complexity compared to their component objects, because the appearance of those objects can change profoundly when they participate in relations. We introduce a dataset suitable for phrasal recognition that uses familiar PASCAL object categories, and demonstrate significant experimental gains resulting from exploiting visual phrases. We show that a visual phrase detector significantly outperforms a baseline which detects component objects and reasons about relations, even though visual phrase training sets tend to be smaller than those for objects. We argue that any multi-class detection system must decode detector outputs to produce final results; this is usually done with nonmaximum suppression. We describe a novel decoding procedure that can account accurately for local context without solving difficult inference pro...