This paper addresses the problem of automatic temporal
annotation of realistic human actions in video using mini-
mal manual supervision. To this end we consider two asso-
ciated problems: (a) weakly-supervised learning of action
models from readily available annotations, and (b) tempo-
ral localization of human actions in test videos. To avoid the
prohibitive cost of manual annotation for training, we use
movie scripts as a means of weak supervision. Scripts, how-
ever, provide only implicit, noisy, and imprecise information
about the type and location of actions in video. We address
this problem with a kernel-based discriminative clustering
algorithm that locates actions in the weakly-labeled train-
ing data. Using the obtained action samples, we train tem-
poral action detectors and apply them to locate actions in
the raw video data. Our experiments demonstrate that the
proposed method for weakly-supervised learning of action
models leads to significant improvement i...