Search Sciweavers | Sciweavers

1256 search results - page 18 / 252

» On Coordinated Checkpointing in Distributed Systems

click to vote

SRDS
2003
IEEE

102views Operating System» more SRDS 2003»

Raptor: Integrating Checkpoints and Thread Migration for Cluster Management

14 years 2 months ago

Download ecadw.colorado.edu

distributed shared-memory (SDSM) provides the abstraction necessary to run shared-memory applications on cost-effective parallel platforms such as clusters of workstations. Howeve...

Hazim Shafi, Evan Speight, John K. Bennett

claim paper

Read More »

click to vote

GI
2004
Springer

113views Theoretical Computer Science» more GI 2004»

Crash Management for Distributed Parallel Systems

14 years 2 months ago

Download www.ti.informatik.uni-frankfurt.de

: With the growing complexity of parallel architectures, the probability of system failures grows, too. One approach to cope with this problem is the self-healing, one of the organ...

Jan Haase, Frank Eschmann

claim paper

Read More »

click to vote

PVM
2005
Springer

117views Distributed And Parallel Com...» more PVM 2005»

Cooperative Write-Behind Data Buffering for MPI I/O

14 years 2 months ago

Download cucis.ece.northwestern.edu

Many large-scale production parallel programs often run for a very long time and require data checkpoint periodically to save the state of the computation for program restart and/o...

Wei-keng Liao, Kenin Coloma, Alok N. Choudhary, Le...

claim paper

Read More »

click to vote

SOSP
2005
ACM

172views Operating System» more SOSP 2005»

Speculative execution in a distributed file system

14 years 5 months ago

Download www.eecs.umich.edu

Speculator provides Linux kernel support for speculative execution. It allows multiple processes to share speculative state by tracking causal dependencies propagated through inte...

Edmund B. Nightingale, Peter M. Chen, Jason Flinn

claim paper

Read More »

click to vote

IPPS
2007
IEEE

129views Distributed And Parallel Com...» more IPPS 2007»

A Fault Tolerance Protocol with Fast Fault Recovery

14 years 3 months ago

Download www.cecs.uci.edu

Fault tolerance is an important issue for large machines with tens or hundreds of thousands of processors. Checkpoint-based methods, currently used on most machines, rollback all ...

Sayantan Chakravorty, Laxmikant V. Kalé

claim paper

Read More »

« Prev « First page 18 / 252 Last » Next »

Sciweavers

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers