This work proposes a novel spoken term detection technique, where the query is in audio format. Detection and retrieval are performed by matching the spectrograms of the spoken document and query as visual images, using ideas from computer vision. Local descriptors are computed on a dense grid over each spectrogram, and the query term is detected using deformable template matching of grids. Detection experiments are perfomed on an hour-long newscast recording, involving 10 query terms of length 2-3 words. When the query term comes from the document, nearly all other instances of the term in the document are detected; performance degrades when the query is recorded by the user.