We present an efficient algorithm for continuous image
recognition and feature descriptor tracking in video which
operates by reducing the search space of possible interest
points inside of the scale space image pyramid. Instead of
performing tracking in 2D images, we search and match
candidate features in local neighborhoods inside the 3D image
pyramid without computing their feature descriptors.
The candidates are further validated by fitting to a motion
model. The resulting tracked interest points are more repeatable
and resilient to noise, and descriptor computation
becomes much more efficient because only those areas
of the image pyramid that contain features are searched.
We demonstrate our method on real-time object recognition
and label augmentation running on a mobile device.