When intelligent systems reason about complex problems with a large hierarchical classification space it is hard to evaluate system performance. For classification problems, different evaluation criteria exist but these either focus on a belief expressed on all possible, mutually exclusive labels (soft classification) or they are based on the set of labels that are returned by a classifier (hard classification) for hierarchical labels. Measures to evaluate a classifier that assigns belief on all labels when these are hierarchical related however are lacking. This paper puts forward two new criteria for evaluation of soft output for hierarchical labels using a generic and flexible model of the solution space. The first criterion gives information on the accuracy of the system and the second on the robustness. Results with these new criteria are compared to existing criteria for a hierarchical classification task with different classifiers.
Wilbert van Norden, Catholijn M. Jonker