Several recently-proposed architectures for highperformance
object recognition are composed of two main
stages: a feature extraction stage that extracts locallyinvariant
feature vectors from regularly spaced image
patches, and a somewhat generic supervised classifier.
The first stage is often composed of three main modules:
(1) a bank of filters (often oriented edge detectors); (2)
a non-linear transform, such as a point-wise squashing
functions, quantization, or normalization; (3) a spatial
pooling operation which combines the outputs of similar
filters over neighboring regions. We propose a method
that automatically learns such feature extractors in an
unsupervised fashion by simultaneously learning the filters
and the pooling units that combine multiple filter outputs
together. The method automatically generates topographic
maps of similar filters that extract features of orientations,
scales, and positions. These similar filters are pooled
together, producing local...