In this paper we address multi-view reconstruction of urban environments using 3D shape grammars. Our formulation expresses the solution to the problem as a shape grammar parse tree where both the tree and the corresponding derivation parameters are unknown. Besides the grammar constraint, the solution is guided by an image support that is twofold. First, we seek for a derivation that induces optimal semantic partitions in the different views. Second, using structure-from-motion, noisy depth maps can be determined towards minimizing their distance from to the ones predicted by any potential solution. We show how the underlying data structure can be efficiently optimized using evolutionary algorithms with automatic parameter selection. To the best of our knowledge, it is the first time that the multi-view 3D procedural modeling problem is tackled. Promising results demonstrate the potentials of the method towards producing a compact representation of urban environments.