Most enterprise data is distributed in multiple relational databases with expert-designed schema. Using traditional single-table machine learning techniques over such data not only incur a computational penalty for converting to a "flat" form (mega-join), even the human-specified semantic information present in the relations is lost. In this paper, we present a twophase hierarchical meta-classification algorithm for relational databases with a semantic divide and conquer approach. We propose a recursive, prediction aggregation technique over heterogeneous classifiers applied on individual database tables. A preliminary evaluation on TPCH and UCI benchmarks shows reduced training time without any loss of prediction accuracy.
Geetha Manjunath, M. Narasimha Murty, Dinkar Sitar