: Group communication in real-time computing systems has been a subject of research for almost two decades but it is not yet a mature technological field. The purpose of this paper...
The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing...
As the size and popularity of computer clusters go on growing, fault tolerance is becoming a crucial factor to ensure high performance and reliability for applications. To provide...
Antonio S. Martins, Ronaldo Augusto Lara Gon&ccedi...
—In this paper, we classify multitolerant systems, i.e., systems that tolerate multiple classes of faults and provide potentially different levels of tolerance to them in terms o...
In this paper, we propose a task scheduling algorithm for a multicore processor system which reduces the recovery time in case of a single fail-stop failure of a multicore processo...