We present a strategy that combines color and depth images to detect people in indoor environments. Similarity of image appearance and closeness in 3D position over time yield weights on the edges of a directed graph that we partition greedily into tracklets, sequences of chronologically ordered observations with high edge weights. Each tracklet is assigned the highest score that a Histograms-of-Oriented Gradients (HOG) person detector yields for observations in the tracklet. High-score tracklets are deemed to correspond to people. Our experiments show a significant improvement in both precision and recall when compared to the HOG detector alone.