Sciweavers

1256 search results - page 17 / 252
» On Coordinated Checkpointing in Distributed Systems
Sort
View
MIDDLEWARE
2007
Springer
14 years 2 months ago
Using checkpointing to recover from poor multi-site parallel job scheduling decisions
Recent research in multi-site parallel job scheduling leverages user-provided estimates of job communication characteristics to effectively partition the job across multiple clus...
William M. Jones
CCGRID
2007
IEEE
14 years 3 months ago
Reparallelization and Migration of OpenMP Programs
Typical computational grid users target only a single cluster and have to estimate the runtime of their jobs. Job schedulers prefer short-running jobs to maintain a high system ut...
Michael Klemm, Matthias Bezold, Stefan Gabriel, Ro...
JVM
2004
102views Education» more  JVM 2004»
13 years 10 months ago
One-Click Distribution of Preconfigured Linux Runtime State
Checkpointing virtual machines shows potential for allowing a user to download, install, and initialize a complete software environment by selecting a web page link. Starting with...
Richard Potter
IPPS
2007
IEEE
14 years 3 months ago
The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI
To be able to fully exploit ever larger computing platforms, modern HPC applications and system software must be able to tolerate inevitable faults. Historically, MPI implementati...
Joshua Hursey, Jeffrey M. Squyres, Timothy Mattox,...
IEEEHPCS
2010
13 years 6 months ago
Using replication and checkpointing for reliable task management in computational Grids
In grid computing systems, providing fault-tolerance is required for both scientific computation and file-sharing to increase their reliability. In previous works, several mechani...
Sangho Yi, Derrick Kondo, Bongjae Kim, Geunyoung P...