Analyzing videos of human activities involves not only
recognizing actions (typically based on their appearances),
but also determining the story/plot of the video. The storyline
of a video describes causal relationships between actions.
Beyond recognition of individual actions, discovering
causal relationships helps to better understand the semantic
meaning of the activities. We present an approach to learn
a visually grounded storyline model of videos directly from
weakly labeled data. The storyline model is represented as
an AND-OR graph, a structure that can compactly encode
storyline variation across videos. The edges in the AND-OR
graph correspond to causal relationships which are represented
in terms of spatio-temporal constraints. We formulate
an Integer Programming framework for action recognition
and storyline extraction using the storyline model and
visual groundings learned from training data