Fault and adversary tolerance have become not only desirable but required properties of software systems because mission-critical systems are commonly distributed on large network...
Abstract. With the number of computing elements spiraling to hundred of thousands in modern HPC systems, failures are common events. Few applications are nevertheless fault toleran...
George Bosilca, Aurelien Bouteiller, Thomas H&eacu...
Classic fault localization techniques can automatically provide information about the suspicious code blocks that are likely responsible for observed failures. This information is...
We present an online framework to capture and recover from program failures and prevent them from occurring in the future through safe execution perturbations. The perturbations a...
Developers write and execute ad-hoc tests as they implement software. While these tests reflect important insights of the developers (e.g., which parts of the software need testi...
Andreas Leitner, Alexander Pretschner, Stefan Mori...