We present a method for live grouping of feature points into persistent 3D clusters as a single camera browses a static scene, with no additional assumptions, training or infrastructure required. The clusters produced depend both on similar appearance and on 3D proximity information derived from real-time structure from motion, and clustering proceeds via interleaved local and global processes which permit scalable real-time operation in scenes with thousands of feature points. Notably, we use a relative 3D distance between the features which makes it possible to adjust the level of detail of the clusters according to their distance from the camera, such that the nearby scene is broken into more finely detailed clusters than the far background. We demonstrate the quality of our approach with video results showing live clustering of several indoor scenes with varying viewpoints and camera motions. The clusters produced are often consistently associated with single objects in the scene,...