Grid environments enable users to share non-dedicated resources that lack performance guarantees. This paper describes the design of application-centric middleware components to automatically recover from failures and dynamically adapt to grid environments with changing resource availabilities, improving fault-tolerance and performance. The key components of the application-centric approach are a global per-application execution history and an autonomic component that tracks the performance of a job on a grid resource against predictions based on the application execution history, to guide rescheduling decisions. Performance models of unmodified applications built using their execution history are used to predict failure as well as poor performance. A prototype of the proposed approach, an Autonomic Virtual Application Manager (AVAM), has been implemented in the context of the In-VIGO grid environment and its effectiveness has been evaluated for applications that generate CPU-intensiv...
Jing Xu, Sumalatha Adabala, José A. B. Fort