This paper presents a method for learning decision theoretic models of facial expressions and gestures from video data. We consider that the meaning of a facial display or gesture to an observer is contained in its relationship to context, actions and outcomes. An agent wishing to capitalize on these relationships must distinguish facial displays and gestures according to their affordances, or how they help the agent to maximize utility. This paper demonstrates how an agent can learn relationships between unlabeled observations of a person's face and gestures, the context, and its own actions and utility function. The agent needs no prior knowledge about the number or the structure of the gestures and facial displays that are valuable to distinguish. The agent discovers classes of human non-verbal behaviors, as well as which are important for choosing actions that optimize over the utility of possible outcomes. This valuedirected model learning allows an agent to focus resources ...
Jesse Hoey, James J. Little