We present an automatic and efficient method to extract spatio-temporal human volumes from video, which combines top-down model-based and bottom-up appearancebased approaches. From the top-down perspective, our algorithm applies shape priors probabilistically to candidate image regions obtained by pedestrian detection, and provides accurate estimates of the human body areas which serve as important constraints for bottom-up processing. Temporal propagation of the identified region is performed with bottom-up cues in an efficient level-set framework, which takes advantage of the sparse top-down information that is available. Our formulation also optimizes the extracted human volume across frames through belief propagation and provides temporally coherent human regions. We demonstrate the ability of our method to extract human body regions efficiently and automatically from a large, challenging dataset collected from YouTube.