Real-world datasets exhibit a complex dependency structure among the data attributes. Learning this structure is a key task in automatic statistics configuration for query optimizers, as well as in data mining, metadata discovery, and system management. In this paper, we provide a new method for discovering dependent attribute pairs based on query feedback. Our approach avoids the problem of searching through a combinatorially large space of candidate attribute pairs, automatically focusing system resources on those pairs of demonstrable interest to users. Unlike previous methods, our technique combines all of the pertinent feedback for a specified pair of attributes in a principled and robust manner, while being simple and fast enough to be incorporated into current commercial products. The method is similar in spirit to the CORDS algorithm, which proactively collects frequencies of data values and computes a chi-squared statistic from the resulting contingency table. In the reacti...
Peter J. Haas, Fabian Hueske, Volker Markl