Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection a...
Pierre Lemarinier, Aurelien Bouteiller, Thomas H&e...
With the increasing number of processors in modern HPC(High Performance Computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolera...
Defining the ways for components around the world to collaborate with each other to execute applications over the internet is one of the biggest challenges for computer scientists...
Abstract. With the number of computing elements spiraling to hundred of thousands in modern HPC systems, failures are common events. Few applications are nevertheless fault toleran...
George Bosilca, Aurelien Bouteiller, Thomas H&eacu...
: Group communication in real-time computing systems has been a subject of research for almost two decades but it is not yet a mature technological field. The purpose of this paper...