Discovering complex associations, anomalies and patterns in distributed data sets is gaining popularity in a range of scientific, medical and business applications. Various algorithms are employed to perform data analysis within a domain, and range from statistical to machine learning and AI based techniques. Several issues need to be addressed however to scale such approaches to large data sets, particularly when these are applied to data distributed at various sites. As new analysis techniques are identified, the core tool set must enable easy integration of such analytical components. Similarly, results from an analysis engines must be sharable, to enable storage, visualisation or further analysis of results. We describe the architecture of PaDDMAS, a component based system for developing distributed data mining applications. PaDDMAS provides a tool set for combining pre-developed or custom components using a dataflow approach, with components performing analysis, data extractio...
Omer F. Rana, David W. Walker, Maozhen Li, Steven