We present an ensemble learning approach that achieves accurate predictions from arbitrarily partitioned data. The partitions come from the distributed processing requirements of a large scale simulation where the volume of the data is such that classifiers can train only on data local to a given partition. As a result of the partition reflecting the need for efficient simulation analysis, rather than the needs of data mining, the class statistics vary across partitions; indeed some classes will likely be absent from some partitions. We combine a fast ensemble learning algorithm with majority voting to generate an accurate working model of the simulation. Results from several simulations show that regions of interest are successfully identified in spite of training set class imbalances. Accuracy is analyzed both at the level of nodes in the simulation data structure, and in terms of higher-level regions of interest. It is shown that over 98% of salient regions are found in indepen...
Larry Shoemaker, Robert E. Banfield, Lawrence O. H