Several stochastic models provide an effective framework to identify the temporal structure of audiovisual data. Most of them need as input a first video structure, i.e. connections between features and video events. Provided that this structure is given as input, the parameters are then estimated from training data. Bayesian networks offer an additional feature, namely structure learning, which allows the automatic construction of the model structure from training data. Structure learning obviously leads to an increased generality of the model building process. This paper investigates the trade-off between the increase of generality and the quality of the results in video analysis. We model video data using dynamic Bayesian networks (DBNs) where the static part of the network accounts for the correlations between low-level features extracted from the raw data and between these features and the events considered. It is precisely this part of the network whose structure is automatical...