Classification of images in many category datasets has
rapidly improved in recent years. However, systems that
perform well on particular datasets typically have one or
more limitations such as a failure to generalize across visual
tasks (e.g., requiring a face detector or extensive retuning
of parameters), insufficient translation invariance,
inability to cope with partial views and occlusion, or significant
performance degradation as the number of classes
is increased.
Here we attempt to overcome these challenges using a
model that combines sequential visual attention using fixations
with sparse coding. The model’s biologically-inspired
filters are acquired using unsupervised learning applied to
natural image patches. Using only a single feature type,
our approach achieves 78.5% accuracy on Caltech-101 and
75.2% on the 102 Flowers dataset when trained on 30 instances
per class and it achieves 92.7% accuracy on the
AR Face database with 1 training instance per perso...