We present a new approach to learning image metrics. The main advantage of our method lies in a formulation that requires only a few pairwise examples. Apparently, based on the little amount of side-information, it would take a very effective learning scheme to yield a useful image metric. Our algorithm achieves this goal by addressing two key issues. First, we establish a global-local (glocal) image representation that induces two structure-meaningful vector spaces to respectively describe the global and the local image properties. Second, we develop a metric optimization framework that finds an optimal bilinear transform to best explain the given side-information. We emphasize it is the glocal image representation that makes the use of bilinear transform more powerful. Experimental results on classifications of face images and visual tracking are included to demonstrate the contributions of the proposed method.