Audio-based music similarity measures can be applied to automatically generate playlists or recommendations. In this paper spectral similarity is combined with complementary information from fluctuation patterns including two new descriptors derived thereof. The performance is evaluated in a series of experiments on four music collections. The evaluations are based on genre classification, assuming that very similar tracks belong to the same genre. The main findings are that, (1) although the improvements are substantial on two of the four collections our extensive experiments confirm earlier findings that we are approaching the limit of how far we can get using simple audio statistics. (2) We have found that evaluating similarity through genre classification is biased by the music collection (and genre taxonomy) used. Furthermore, (3) in a cross validation no pieces from the same artist should be in both training and test set.