The systematic testing of the very many parameters and algorithmic variants involved in the design of high-level music descriptors at large, and similarity measure in particular, is a daunting task, which requires the building of a general architecture which is nearly as complex as a fullfledge Music Browsing system. In this paper, we report on experiments done in an attempt to improve the performance of the music similarity measure described in [2], using the Cuidado Music Browser ([8]). We do not principally report on the actual results of the evaluation, but rather on the methodology and the various tools that were built to support such a task. We show that many nontechnical browsing features are useful at various stages of the evaluation process, and in turn that some of the tools developed for the expert user can be reinjected into the Music Browser, and benefit the non-technical user.