This paper presents a novel motion localization approach for recognizing actions and events in real videos. Examples include StandUp and Kiss in Hollywood movies. The challenge can be attributed to the large visual and motion variations imposed by realistic action poses. Previous works mainly focus on learning from descriptors of cuboids around space time interest points (STIP) to characterize actions. The size, shape and space-time position of cuboids are fixed without considering the underlying motion dynamics. This often results in large set of fragmentized cuboids which fail to capture long-term dynamic properties of realistic actions. This paper proposes the detection of spatio-temporal motion volumes (namely Volume of Interest, VOI) of scale and position adaptive to localize actions. First, motions are described as bags of point trajectories by tracking keypoints along the time dimension. VOIs are then adaptively extracted by clustering trajectory on the motion mainfold. The re...