Fault tolerance is an important issue for large machines with tens or hundreds of thousands of processors. Checkpoint-based methods, currently used on most machines, rollback all ...
To be able to fully exploit ever larger computing platforms, modern HPC applications and system software must be able to tolerate inevitable faults. Historically, MPI implementati...
Joshua Hursey, Jeffrey M. Squyres, Timothy Mattox,...
As the size and popularity of computer clusters go on growing, fault tolerance is becoming a crucial factor to ensure high performance and reliability for applications. To provide...
Antonio S. Martins, Ronaldo Augusto Lara Gon&ccedi...
Abstract — Distributed sensor system applications (e.g., wireless sensor networks) have been studied extensively in recent years. Such applications involve resource-limited embed...
Chung-Ching Shen, Roni Kupershtok, Shuvra S. Bhatt...
Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amount of idle computational resources available on the Internet. Such systems allow guest jobs to run on a ho...
We study recent developments in quantum computing (QC) testing and fault tolerance (FT) techniques and discuss several attempts to formalize quantum logic fault models. We illustr...
David Y. Feinstein, V. S. S. Nair, Mitchell A. Tho...
The emergence of computational grids has lead to an increased reliance on task schedulers that can guarantee the completion of tasks that are executed on unreliable systems. There...
Daniel C. Vanderster, Nikitas J. Dimopoulos, Randa...
Programmable logic arrays (PLA), which can implement arbitrary logic functions in a two-level logic form, are promising as platforms for nanoelectronic logic due to their highly r...
Applications that span multiple virtual organizations (VOs) are of great interest to the eScience community. However, recent attempts to execute large-scale parameter sweep applic...
Shahaan Ayyub, David Abramson, Colin Enticott, Sla...
Real-time applications typically have to satisfy high dependability requirements and require fault tolerance in both value and time domains. A widely used approach to ensure fault...