One prevalent method for evaluating the results of automated software analysis tools is to compare the tools’ output to the judgment of human experts. This evaluation strategy is commonly assumed in the field of software clone detector research. We report our experiences from a study using several human judges who tried to establish “reference sets” of function clones for several medium-sized software systems written in C. The study employed multiple judges and followed a process typical for inter-coder reliability assurance wherein coders discussed classification discrepancies until consensus is reached. A high level of disagreement was found for reference sets made specifically for reengineering task contexts. The results, although preliminary, raise questions about limitations of prior clone detector evaluations and other similar tool evaluations. Implications are drawn for future work on reference data generation, tool evaluations, and benchmarking efforts.