Search Sciweavers | Sciweavers

185 search results - page 30 / 37

» Software monitoring with bounded overhead

227

click to vote

SC
2000
ACM

110views Applied Computing» more SC 2000»

Scalable Fault-Tolerant Distributed Shared Memory

15 years 11 months ago

Download www.sc2000.org

This paper shows how a state-of-the-art software distributed shared-memory (DSM) protocol can be eﬃciently extended to tolerate single-node failures. In particular, we extend a ...

Florin Sultan, Thu D. Nguyen, Liviu Iftode

claim paper

Read More »

230

click to vote

PODC
2006
ACM

342views Distributed And Parallel Com...» more PODC 2006»

Grouped distributed queues: distributed queue, proportional share multiprocessor scheduling

16 years 27 days ago

Download www.ncl.cs.columbia.edu

We present Grouped Distributed Queues (GDQ), the ﬁrst proportional share scheduler for multiprocessor systems that scales well with a large number of processors and processes. G...

Bogdan Caprita, Jason Nieh, Clifford Stein

claim paper

Read More »

177

click to vote

IPPS
2005
IEEE

132views Distributed And Parallel Com...» more IPPS 2005»

Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

16 years 15 days ago

Download adam.oliner.net

Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a commo...

Adam J. Oliner, Ramendra K. Sahoo, José E. ...

claim paper

Read More »

192

click to vote

ENTCS
2007

113views more ENTCS 2007»

Modular Checkpointing for Atomicity

15 years 6 months ago

Download www.cs.purdue.edu

Transient faults that arise in large-scale software systems can often be repaired by re-executing the code in which they occur. Ascribing a meaningful semantics for safe re-execut...

Lukasz Ziarek, Philip Schatz, Suresh Jagannathan

claim paper

Read More »

178

click to vote

JFP
2010

107views more JFP 2010»

Lightweight checkpointing for concurrent ML

15 years 5 months ago

Download www.cs.purdue.edu

Transient faults that arise in large-scale software systems can often be repaired by re-executing the code in which they occur. Ascribing a meaningful semantics for safe re-execut...

Lukasz Ziarek, Suresh Jagannathan

claim paper

Read More »

« Prev « First page 30 / 37 Last » Next »

Sciweavers

Explore & Download

Productivity Tools

Sciweavers