Wepresent a novel, fast methodfor associationminingill high-dimensionaldatasets. OurCoincidence Detection method, which combines random sampling and Chernoff-Hoeffding bounds with a novel coding/binning scheme, avoids the exhaustive search, prior limits on the order k of discovered associations, and exponentially large parameter space of other methods. Tight theoretical bounds on the complexity of randomized algorithms are impossible without strong input distribution assumptions. However,weobserve sublineal" time, space and data complexityin tests on constructedartificial datasets and in real application to importantproblemsin bioinformatics and drug discovery. After placing the methodin historical and mathematicalcontext, wedescribe the method,and present theoretical and empirical results on its complexityanderror. Getting information from a table is like extracting sunlight from a cucumber. (H. Farquhar, "Economicand Industrial Delusions", 1891)
Evan W. Steeg, Derek A. Robinson, Ed Willis