We study unsupervised learning of occluding objects in images of visual scenes. The derived learning algorithm is based on a probabilistic generative model which parameterizes object shapes, object features and the background. No assumptions are made for the object orders in depth or the objects’ planar positions. Parameter optimization is thus subject to the large combinatorics of depth orders and positions. Previous approaches constrained this combinatorics but were still only able to learn a very small number of objects. By applying a novel variational EM approach, we show that even without constraints on the object combinatorics, a relatively large number of objects can be learned. In different numerical experiments, our unsupervised approach extracts explicit object representations with object masks and object features closely aligned with the true objects in the scenes. We investigate the robustness of the approach and the use of the learned representations for inference. Furt...