Sciweavers

86 search results - page 6 / 18
» Hybrid checkpointing for parallel applications in cluster fe...
Sort
View
PDCAT
2009
Springer
14 years 2 months ago
CheCUDA: A Checkpoint/Restart Tool for CUDA Applications
Abstract—In this paper, a tool named CheCUDA is designed to checkpoint CUDA applications that use GPUs as accelerators. As existing checkpoint/restart implementations do not supp...
Hiroyuki Takizawa, Katsuto Sato, Kazuhiko Komatsu,...
HPDC
2000
IEEE
13 years 12 months ago
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing
A new RAID-x (redundant array of inexpensive disks at level x) architecture is presented for distributed I/O processing on a serverless cluster of computers. The RAID-x architectu...
Kai Hwang, Hai Jin, Roy S. C. Ho
CLUSTER
2004
IEEE
13 years 11 months ago
Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
Fault tolerance is a very important concern for critical high performance applications using the MPI library. Several protocols provide automatic and transparent fault detection a...
Pierre Lemarinier, Aurelien Bouteiller, Thomas H&e...
ICDCS
2012
IEEE
11 years 10 months ago
Combining Partial Redundancy and Checkpointing for HPC
Today’s largest High Performance Computing (HPC) systems exceed one Petaflops (1015 floating point operations per second) and exascale systems are projected within seven years...
James Elliott, Kishor Kharbas, David Fiala, Frank ...
IPPS
2006
IEEE
14 years 1 months ago
Coordinated checkpoint from message payload in pessimistic sender-based message logging
Execution of MPI applications on Clusters and Grid deployments suffers from node and network failure that motivates the use of fault tolerant MPI implementations. Two category tec...
M. Aminian, Mohammad K. Akbari, Bahman Javadi