This paper offers a novel look at using a dimensionalityreduction technique called simhash [8] to detect similar document pairs in large-scale collections. We show that this algorithm produces interesting intermediate data, which is normally discarded, that can be used to predict which of the bits in the final hash are more susceptible to being flipped in similar documents. This paves the way for a probabilistic search technique in the Hamming space of simhashes that can be significantly faster and more space-efficient than the existing simhash approaches. We show that with 95% recall compared to deterministic search of prior work [16], our method exhibits 4-14 times faster lookup and requires 2-10 times less RAM on our collection of 70M web pages. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Clustering General Terms Algorithms Keywords Hamming distance, similarity, simhash, clustering