Abstract This paper presents an approach to designing and implementing extensible computational models for perceiving systems based on a knowledge-driven joint inference approach. These models can integrate different sources of information both horizontally (multi-modal and temporal fusion) and vertically (bottom-up, topdown) by incorporating prior hierarchical knowledge expressed as an extensible ontology. Two implementations of this approach are presented. The first consists of a content based image retrieval system which allows users to search image databases using an ontological query language. Queries are parsed using a probabilistic grammar and Bayesian networks to map high level concepts onto low level image descriptors, thereby bridging the "semantic gap" between users and the retrieval system. The second application extends the notion of ontological languages to video event detection. It is shown how effective high-level state and event recognition mechanisms can be ...