The work proposes a hierarchical architecture for learning amic scenes at various levels of knowledge abstraction. The raw visual information is processed at different stages to generate hybrid symbolic/sub-symbolic descriptions of the scene, agents and events. The background is incrementally learned at the lowest layer, which is used further in the mid-level for multi-agent tracking with symbolic reasoning. The agent/event discovery is performed at the next higher layer by processing the agent features, status history and trajectory. Unlike existing vision systems, the proposed algorithm does not assume any prior information and aims at learning the scene/agent/event models from the acquired images. This makes it a versatile vision system capable of performing in a wide variety of environments.