Data mining systems aim to discover patterns and extract useful information from facts recorded in databases. A widely adopted approach is to apply machine learning algorithms to compute descriptive models or classifiers from the available data. Two of the main challenges in this area are that i) databases are large and possibly physically distributed, and ii) data are cost-sensitive, or examples in the databases usually have different prices or benefits (such as charity donation amount) that require an effective model to be more accurate towards examples with higher benefits. Here, we explore the development of techniques that address both issues to scale up cost-sensitive data mining. One naive approach for distributed data mining is a centralized system that ships all available data from different sites onto a single site to learn a global model. Besides its obvious communication overhead, this approach is ineffective due to many practical concerns. The second approach is a part...
Wei Fan, Haixun Wang, Philip S. Yu, Salvatore J. S