In the paper, we propose a general method for statistical performance evaluation. The method incorporates various statistical metrics and automatically selects an appropriate statistical metric according to the problem parameters. Empirically, We compare the performance of five representative statistical metrics under different conditions through simulation. They are expected loss, Friedman statistic, interval-based selection, probability of win, and probably approximately correct. In the experiments, expected loss is the best for small means, like 1 or 2, and probably approximately correct is the best for all the other cases. Also, we apply the general method to compare the performance of HITS-based algorithms that combine four relevance scoring methods, VSM, Okapi, TLS, and CDR, using a set of broad topic queries. Among the four relevance scoring methods, CDR is the best statistically when it is combined with a HITS-based algorithm.