Sciweavers

464 search results - page 57 / 93
» A Fault Tolerance Protocol with Fast Fault Recovery
Sort
View
ASPLOS
2009
ACM
14 years 8 months ago
ASSURE: automatic software self-healing using rescue points
Software failures in server applications are a significant problem for preserving system availability. We present ASSURE, a system that introduces rescue points that recover softw...
Stelios Sidiroglou, Oren Laadan, Carlos Perez, Nic...
PROMAS
2005
Springer
14 years 1 months ago
A Model-Based Executive for Commanding Robot Teams
The paper presents a way to robustly command a system of systems as a single entity. Instead of modeling each component system in isolation and then manually crafting interaction p...
Anthony Barrett
HPDC
2011
IEEE
12 years 11 months ago
Algorithm-based recovery for iterative methods without checkpointing
In today’s high performance computing practice, fail-stop failures are often tolerated by checkpointing. While checkpointing is a very general technique and can often be applied...
Zizhong Chen
DSD
2008
IEEE
147views Hardware» more  DSD 2008»
13 years 9 months ago
A Low-Cost Cache Coherence Verification Method for Snooping Systems
Due to modern technology trends such as decreasing feature sizes and lower voltage levels, fault tolerance is becoming increasingly important in computing systems. Shared memory i...
Demid Borodin, Ben H. H. Juurlink
CONCURRENCY
2010
110views more  CONCURRENCY 2010»
13 years 7 months ago
Redesigning the message logging model for high performance
Over the past decade the number of processors in the high performance facilities went up to hundreds of thousands. As a direct consequence, while the computational power follow th...
Aurelien Bouteiller, George Bosilca, Jack Dongarra