We address the character identification problem in
movies and television videos: assigning names to faces on
the screen. Most prior work on person recognition in video
assumes some supervised data such as screenplay or handlabeled
faces. In this paper, our only source of ‘supervision’
are the dialog cues: first, second and third person
references (such as “I’m Jack”, “Hey, Jack!” and “Jack
left”). While this kind of supervision is sparse and indirect,
we exploit multiple modalities and their interactions (appearance,
dialog, mouth movement, synchrony, continuityediting
cues) to effectively resolve identities through local
temporal grouping followed by global weakly supervised
recognition. We propose a novel temporal grouping model
that partitions face tracks across multiple shots while respecting
appearance, geometric and film-editing cues and
constraints. In this model, states represent partitions of the
k most recent face tracks, and transitions repr...