We propose a multi-view stereo reconstruction algorithm which recovers urban scenes as a combination of meshes and geometric primitives. It provides a compact model while preserving details: irregular elements such as statues and ornaments are described by meshes whereas regular structures such as columns and walls are described by primitives (planes, spheres, cylinders, cones and tori). A JumpDiffusion process is designed to sample these two types of elements simultaneously. The quality of a reconstruction is measured by a multi-object energy model which takes into account both photo-consistency and semantic considerations (i.e. geometry and shape layout). The sampler is embedded into an iterative refinement procedure which provides an increasingly accurate hybrid representation. Experimental results on complex urban structures and large scenes are presented and compared to multi-view based meshing algorithms.