Sciweavers

185 search results - page 30 / 37
» Software monitoring with bounded overhead
Sort
View
SC
2000
ACM
14 years 1 months ago
Scalable Fault-Tolerant Distributed Shared Memory
This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be efficiently extended to tolerate single-node failures. In particular, we extend a ...
Florin Sultan, Thu D. Nguyen, Liviu Iftode
PODC
2006
ACM
14 years 2 months ago
Grouped distributed queues: distributed queue, proportional share multiprocessor scheduling
We present Grouped Distributed Queues (GDQ), the first proportional share scheduler for multiprocessor systems that scales well with a large number of processors and processes. G...
Bogdan Caprita, Jason Nieh, Clifford Stein
IPPS
2005
IEEE
14 years 2 months ago
Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a commo...
Adam J. Oliner, Ramendra K. Sahoo, José E. ...
ENTCS
2007
113views more  ENTCS 2007»
13 years 8 months ago
Modular Checkpointing for Atomicity
Transient faults that arise in large-scale software systems can often be repaired by re-executing the code in which they occur. Ascribing a meaningful semantics for safe re-execut...
Lukasz Ziarek, Philip Schatz, Suresh Jagannathan
JFP
2010
107views more  JFP 2010»
13 years 7 months ago
Lightweight checkpointing for concurrent ML
Transient faults that arise in large-scale software systems can often be repaired by re-executing the code in which they occur. Ascribing a meaningful semantics for safe re-execut...
Lukasz Ziarek, Suresh Jagannathan