Different from appearance-based methods, clustering feature points only by their motion coherence is an emerging category of approach to detecting and tracking individuals among crowds. This paper reformalizes the problem and models a novel objective function for clustering with potential functions as in conditional random field approach. The merits include: 1) it integrates motion, spatial, temporal information; 2) the parameters are automatically obtained by supervised learning; 3) the objective function is based on feature-pair information, which enables effective learning on small amount of training data, as well as very fast online processing speed. Detection ROC curves are given on several datasets (including the CAVIAR set).