There are many computer vision algorithms developed for visual (scene and object) recognition. Some systems focus on involved learning algorithms, some leverage millions of training images, and some systems focus on modeling relevant information (features) with the goal of effective recognition. However, none of these systems come close to human capabilities. If we study human responses on similar problems we could gain insight into which of the three factors (1) learning algorithm (2) amount of training data and (3) features is critical to humans’ superior performance.
In this work we take a small step towards this goal by performing a series of human studies and machine experiments. We find no evidence that human pattern matching algorithms are better than standard machine learning algorithms. Moreover, we find that humans don’t leverage increased amounts of training data. Through statistical analysis on the machine experiments and supporting human studies, we find that the m...
Devi Parikh and C. Lawrence Zitnick