Sciweavers

204 search results - page 10 / 41
» Fault-tolerant solutions for a MPI compute intensive applica...
Sort
View
HIPC
2009
Springer
13 years 5 months ago
Fast checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on multicore architecture
Large scale compute clusters continue to grow to ever-increasing proportions. However, as clusters and applications continue to grow, the Mean Time Between Failures (MTBF) has redu...
Xiangyong Ouyang, Karthik Gopalakrishnan, Tejus Ga...
IPPS
2007
IEEE
14 years 1 months ago
DejaVu: Transparent User-Level Checkpointing, Migration, and Recovery for Distributed Systems
In this paper, we present a new fault tolerance system called DejaVu for transparent and automatic checkpointing, migration, and recovery of parallel and distributed applications....
Joseph F. Ruscio, Michael A. Heffner, Srinidhi Var...
ICDCS
2012
IEEE
11 years 10 months ago
Combining Partial Redundancy and Checkpointing for HPC
Today’s largest High Performance Computing (HPC) systems exceed one Petaflops (1015 floating point operations per second) and exascale systems are projected within seven years...
James Elliott, Kishor Kharbas, David Fiala, Frank ...
ISORC
2000
IEEE
13 years 12 months ago
Scheduling Solutions for Supporting Dependable Real-Time Applications
This paper deals with tolerance to timing faults in time-constrained systems. TAFT (Time Aware Fault-Tolerant) is a recently devised approach which applies tolerance to timing vio...
F. Sandrini, Felicita Di Giandomenico, Andrea Bond...
CCGRID
2009
IEEE
14 years 2 months ago
BLAST Application with Data-Aware Desktop Grid Middleware
—There exists numerous Grid middleware to develop and execute programs on the computational Grid, but they still require intensive work from their users. BitDew is made to facili...
Haiwu He, Gilles Fedak, Bing Tang, Franck Cappello