Sciweavers

668 search results - page 5 / 134
» Implementing and Evaluating Automatic Checkpointing
Sort
View
IPPS
2007
IEEE
14 years 1 months ago
A Fault Tolerance Protocol with Fast Fault Recovery
Fault tolerance is an important issue for large machines with tens or hundreds of thousands of processors. Checkpoint-based methods, currently used on most machines, rollback all ...
Sayantan Chakravorty, Laxmikant V. Kalé
HPCA
2005
IEEE
14 years 7 months ago
Checkpointed Early Load Retirement
Long-latency loads are critical in today's processors due to the ever-increasing speed gap with memory. Not only do these loads block the execution of dependent instructions,...
Nevin Kirman, Meyrem Kirman, Mainak Chaudhuri, Jos...
IPPS
2005
IEEE
14 years 9 days ago
Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI
— Fault tolerance in MPI becomes a main issue in the HPC community. Several approaches are envisioned from user or programmer controlled fault tolerance to fully automatic fault ...
Aurelien Bouteiller, Boris Collin, Thomas Hé...
ICDE
2011
IEEE
265views Database» more  ICDE 2011»
12 years 10 months ago
RAFTing MapReduce: Fast recovery on the RAFT
MapReduce is a computing paradigm that has gained a lot of popularity as it allows non-expert users to easily run complex analytical tasks at very large-scale. At such scale, task...
Jorge-Arnulfo Quiané-Ruiz, Christoph Pinkel...
CLUSTER
2003
IEEE
14 years 2 hour ago
HPCM: A Pre-Compiler Aided Middleware for the Mobility of Legacy Code
Mobility is a fundamental functionality of the next generation internet computing. How to support mobility for legacy codes, however, is still an issue of research. The key to sol...
Cong Du, Xian-He Sun, Kasidit Chanchio