The productivity of HPC system is determined not only by their performance, but also by their reliability. The conventional method to limit the impact of failures is checkpointing...
Existing Byzantine-resilient replication protocols satisfy two standard correctness criteria, safety and liveness, in the presence of Byzantine faults. In practice, however, fault...
Yair Amir, Brian A. Coan, Jonathan Kirsch, John La...
Networks of Workstations (NOW) have become important and cost-effective parallel platforms for scientific computations. In practice, a NOW system is heterogeneous and non-dedicat...
Left unchecked, the fundamental drive to increase peak performance using tens of thousands of power hungry components will lead to intolerable operating costs and failure rates. R...
Abstract--With the development of high-performance computing, I/O issues have become the bottleneck for many massively parallel applications. This paper investigates scalable paral...
Jing Fu, Ning Liu, Onkar Sahni, Kenneth E. Jansen,...