In evolutionary computation, experimental results are commonly analyzed using an algorithmic performance metric called best-so-far. While best-so-far can be a useful metric, its use is particularly susceptible to three pitfalls: a failure to establish a baseline for comparison, a failure to perform significance testing, and an insufficient sample size. The nature of best-so-far means that it is highly susceptible to these pitfalls. If these pitfalls are not avoided, the use of the best-so-far metric can lead to confusion at best and misleading results at worst. We detail how the use of multiple experimental runs, random search as a baseline, and significance testing can help researchers avoid these common pitfalls. Furthermore, we demonstrate how best-sofar can be an effective algorithmic performance metric if these guidelines are followed. Categories and Subject Descriptors: I.2.8 Artificial Intelligence: Problem Solving, Control Methods, and Search General Terms: Experimentation...
Nathaniel P. Troutman, Brent E. Eskridge, Dean F.