The ambiguity inherent in a localized analysis of events from video can be resolved by exploiting constraints between events and examining only feasible global explanations. We show how jointly recognizing and linking events can be formulated as labeling of a Bayesian network. The framework can be extended to multiple linking layers, expressing explanations as compositional hierarchies. The best global explanation is the Maximum a Posteriori (MAP) solution over a set of feasible explanations. The search space is sampled using Reversible Jump Markov Chain Monte Carlo (RJMCMC). We propose a set of general move types that is extensible to multiple layers of linkage, and use simulated annealing to find the MAP solution given all observations. We provide experimental results for a challenging two-layer linkage problem, demonstrating the ability to recognise and link drop and pick events of bicycles in a rack over five days.