We report the results of an experiment to assess the ability of automated MT evaluation metrics to remain sensitive to variations in MT quality as the average quality of the compared systems goes up. We compare two groups of metrics: those which measure the proximity of MT output to some reference translation, and those which evaluate the performance of some automated process on degraded MT output. The experiment shows that proximity-based metrics (such as BLEU) loose sensitivity as the scores go up, but performance-based metrics (e.g., Named Entity recognition from MT output) remain sensitive across the scale. We suggest a model for explaining this result, which attributes the stable sensitivity of performance-based metrics to measuring the cumulative functional effect of different language levels, while proximity-based metrics measure structural matches at a lexical level only and therefore miss higher-level errors that are more typical for better MT systems. Development of new auto...