Sciweavers

359 search results - page 5 / 72
» A Framework for Experimental Validation and Performance Eval...
Sort
View
HCW
2000
IEEE
13 years 11 months ago
Evaluation of PAMS' Adaptive Management Services
Management of large-scale parallel and distributed applications is an extremely complex task due to factors such as centralized management architectures, lack of coordination and ...
Yoonhee Kim, Salim Hariri, Muhamad Djunaedi
IPPS
2007
IEEE
14 years 1 months ago
The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI
To be able to fully exploit ever larger computing platforms, modern HPC applications and system software must be able to tolerate inevitable faults. Historically, MPI implementati...
Joshua Hursey, Jeffrey M. Squyres, Timothy Mattox,...
CCGRID
2010
IEEE
13 years 7 months ago
Selective Recovery from Failures in a Task Parallel Programming Model
Abstract--We present a fault tolerant task pool execution environment that is capable of performing fine-grain selective restart using a lightweight, distributed task completion tr...
James Dinan, Arjun Singri, P. Sadayappan, Sriram K...
ET
2007
101views more  ET 2007»
13 years 6 months ago
Towards Nanoelectronics Processor Architectures
In this paper, we focus on reliability, one of the most fundamental and important challenges, in the nanoelectronics environment. For a processor architecture based on the unreliab...
Wenjing Rao, Alex Orailoglu, Ramesh Karri
ICCD
2005
IEEE
159views Hardware» more  ICCD 2005»
14 years 9 days ago
Architectural-Level Fault Tolerant Computation in Nanoelectronic Processors
Nanoelectronic devices are expected to have extremely high and variable fault rates; thus future processor architectures based on these unreliable devices need to be built with fa...
Wenjing Rao, Alex Orailoglu, Ramesh Karri