We consider the problem of constructing decision trees for entity identification from a given table. The input is a table containing information about a set of entities over a fixed set of attributes. The goal is to construct a decision tree that identifies each entity unambiguously by testing the attribute values such that the average number of tests is minimized. The previously best known approximation ratio for this problem was O(log2 N). In this paper, we present a new greedy heuristic that yields an improved approximation ratio of O(log N).
Venkatesan T. Chakaravarthy, Vinayaka Pandit, Samb