Visual categorization problems, such as object classification or action recognition,
are increasingly often approached using a detection strategy: a classifier function
is first applied to candidate subwindows of the image or the video, and then the
maximum classifier score is used for class decision. Traditionally, the subwindow
classifiers are trained on a large collection of examples manually annotated with
masks or bounding boxes. The reliance on time-consuming human labeling effectively
limits the application of these methods to problems involving very few
categories. Furthermore, the human selection of the masks introduces arbitrary
biases (e.g. in terms of window size and location) which may be suboptimal for
classification.
In this report we propose a novel method for learning a discriminative subwindow
classifier from examples annotated with binary labels indicating the presence
of an object or action of interest, but not its location. During training, our approach...