The objective of this research is to convert ordinary idle PCs into virtual clusters for executing parallel applications. The paper introduces VolpexMPI that is designed to enable ...
Troy LeBlanc, Rakhi Anand, Edgar Gabriel, Jaspal S...
Recently there has been renewed interest in building reliable servers that support continuous application operation. Besides maintaining system state consistent after a failure, o...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a commo...
As the size and popularity of computer clusters go on growing, fault tolerance is becoming a crucial factor to ensure high performance and reliability for applications. To provide...
Antonio S. Martins, Ronaldo Augusto Lara Gon&ccedi...
Fault-tolerance techniques based on checkpointing and message logging have been increasingly used in real-world applications to reduce service down-time. Most industrial applicati...