We propose a new approach for video event learning. The only hypothesis is the availability of tracked object attributes. The approach incrementally aggregates the attributes and reliability information of tracked objects to learn a hierarchy of state and event concepts. Simultaneously, the approach recognises the states and events of the tracked objects. This approach proposes an automatic bridge between the low-level image data and higher level conceptual information. The approach has been evaluated for more than two hours of an elderly care application. The results show the capability of the approach to learn and recognise meaningful events occurring in the scene. Also, the results show the potential of the approach for giving a description of the activities of a person (e.g. approaching to a table, crouching), and to detect abnormal events based on the frequency of occurrence.