ct Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different methods to handle process failures beyond simple check-point restart schemes. The init...
Graham E. Fagg, Thara Angskun, George Bosilca, Jel...
An increasing number of applications are being developed using distributed object computing (DOC) middleware, such as CORBA. Many of these applications require the underlying midd...
Aniruddha S. Gokhale, Balachandran Natarajan, Doug...
An increasing number of mission-critical, embedded, telecommunications, and financial distributed systems are being developed using distributed object computing middleware, such a...
Balachandran Natarajan, Aniruddha S. Gokhale, Shal...
As the number of processors in today’s high performance computers continues to grow, the mean-time-to-failure of these computers are becoming significantly shorter than the exe...
Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julie...
Fault tolerance is a major concern to guarantee availability of critical services as well as application execution. Traditional approaches for fault tolerance include checkpoint/r...