In document analysis, it is common to prove the usefulness of a component by an experimental evaluation. By applying the respective algorithms to a test sample, some effectiveness measures such as recall, precision, and accuracy are computed. The goal of such an evaluation is two-fold: on the one hand it shows that the absolute effectiveness of the algorithm is acceptable for practical use. On the other hand, the evaluation can prove that the algorithm has a better or worse effectiveness than another algorithm. In this paper we argue that the experimental evaluation on relative small test sets