In many recent object recognition systems, feature extraction
stages are generally composed of a filter bank, a
non-linear transformation, and some sort of feature pooling
layer. Most systems use only one stage of feature extraction
in which the filters are hard-wired, or two stages where
the filters in one or both stages are learned in supervised
or unsupervised mode. This paper addresses three questions:
1. How does the non-linearities that follow the filter
banks influence the recognition accuracy? 2. does learning
the filter banks in an unsupervised or supervised manner
improve the performance over random filters or hardwired
filters? 3. Is there any advantage to using an architecture
with two stages of feature extraction, rather than
one? We show that using non-linearities that include rectification
and local contrast normalization is the single most
important ingredient for good accuracy on object recognition
benchmarks. We show that two stages of feature extracti...