The automatic creation of 3D models of urban spaces has become a very active field of research. This has been inspired by recent applications in the location-awareness on the Internet, as demonstrated in maps.live.com and similar websites. The level of automation in creating 3D city models has increased considerably, and has benefited from an increase in the redundancy of the source imagery, namely digital aerial photography. In this paper we argue that the next big step forward is to replace photographic texture by an interpretation of what the texture describes, and to achieve this fully automatically. One calls the result "semantic knowledge". For example we want to know that a certain part of the image is a car, a person, a building, a tree, a shrub, a window, a door, instead of just a collection of 3D points or triangles with a superimposed photographic texture. We investigate object recognition methods to make this next big step. We demonstrate an early result of using...