Sciweavers

116 search results - page 5 / 24
» A Communication Framework for Fault-Tolerant Parallel Execut...
Sort
View
CORR
2010
Springer
92views Education» more  CORR 2010»
13 years 7 months ago
Efficient System-Enforced Deterministic Parallelism
Deterministic execution offers many benefits for debugging, fault tolerance, and security. Current methods of executing parallel programs deterministically, however, often incur h...
Amittai Aviram, Shu-Chun Weng, Sen Hu, Bryan Ford
PVM
2009
Springer
14 years 2 months ago
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes
The objective of this research is to convert ordinary idle PCs into virtual clusters for executing parallel applications. The paper introduces VolpexMPI that is designed to enable ...
Troy LeBlanc, Rakhi Anand, Edgar Gabriel, Jaspal S...
HPCC
2010
Springer
13 years 7 months ago
A Generic Execution Management Framework for Scientific Applications
Managing the execution of scientific applications in a heterogeneous grid computing environment can be a daunting task, particularly for long running jobs. Increasing fault tolera...
Tanvire Elahi, Cameron Kiddle, Rob Simmonds
CASES
2009
ACM
14 years 2 months ago
Towards scalable reliability frameworks for error prone CMPs
As technology scales and the energy of computation continually approaches thermal equilibrium [1,2], parameter variations and noise levels will lead to larger error rates at vario...
Joseph Sloan, Rakesh Kumar
ICDCS
2012
IEEE
11 years 9 months ago
Combining Partial Redundancy and Checkpointing for HPC
Today’s largest High Performance Computing (HPC) systems exceed one Petaflops (1015 floating point operations per second) and exascale systems are projected within seven years...
James Elliott, Kishor Kharbas, David Fiala, Frank ...