Recognizing multiple interleaved activities in a video requires implicitly partitioning the detections for each activity. Furthermore, constraints between activities are important in finding valid explanations for all detections. We use Attribute Multiset Grammars (AMGs) as a formal representation for a domain’s knowledge to encode intra- and inter-activity constraints. We show how AMGs can be used to parse all the observations into ‘feasible’ global explanations. We also present an algorithm for building a Bayesian network (BN) given an AMG and a set of detections. The set of labellings of the BN corresponds to the set of all possible parse trees. Finding the best explanation then amounts to finding the maximum a posteriori labeling of the BN. The technique is successfully applied to two different problems including the challenging problem of associating pedestrians and carried objects entering and departing a building.