Software failures in server applications are a significant problem for preserving system availability. We present ASSURE, a system that introduces rescue points that recover softw...
Stelios Sidiroglou, Oren Laadan, Carlos Perez, Nic...
The paper presents a way to robustly command a system of systems as a single entity. Instead of modeling each component system in isolation and then manually crafting interaction p...
In today’s high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied...
Due to modern technology trends such as decreasing feature sizes and lower voltage levels, fault tolerance is becoming increasingly important in computing systems. Shared memory i...
Over the past decade the number of processors in the high performance facilities went up to hundreds of thousands. As a direct consequence, while the computational power follow th...
Aurelien Bouteiller, George Bosilca, Jack Dongarra