We describe a focused effort to investigate the performance of phrase-based, human evaluation of machine translation output achieving a high annotator agreement. We define phrase-based evaluation and describe the implementation of Appraise, a toolkit that supports the manual evaluation of machine translation results. Phrase ranking can be done using either a fine-grained six-way scoring scheme that allows to differentiate between "much better" and "slightly better", or a reduced subset of ranking choices. Afterwards we discuss values for both scoring models from several experiments conducted with human annotators. Our results show that phrase-based evaluation can be used for fast evaluation obtaining significant agreement among annotators. The granularity of ranking choices should, however, not be too fine-grained as this seems to confuse annotators and thus reduces the overall agreement. The work reported in this paper confirms previous work in the field and illus...