In this paper, we investigate the detection of semantic
human actions in complex scenes. Unlike conventional
action recognition in well-controlled environments,
action detection in complex scenes suffers from cluttered
backgrounds, heavy crowds, occluded bodies, and spatialtemporal
boundary ambiguities caused by imperfect human
detection and tracking. Conventional algorithms are
likely to fail with such spatial-temporal ambiguities. In this
work, the candidate regions of an action are treated as a
bag of instances. Then a novel multiple-instance learning
framework, named SMILE-SVM (Simulated annealingMultiple
Instance LEarning Support Vector Machines), is presented
for learning human action detector based on imprecise
action locations. SMILE-SVM is extensively evaluated
with satisfactory performances on two tasks: 1) human action
detection on a public video action database with cluttered
backgrounds, and 2) a real world problem of detecting
whether the customers in a s...