In this paper, we focus on reliability, one of the most fundamental and important challenges, in the nanoelectronics environment. For a processor architecture based on the unreliab...
With respect to scalability and arbitrary topologies of the underlying networks in multiprogramming and multithread environment, fault tolerance in acknowledged ATAB and concurren...
Yuzhong Sun, Paul Y. S. Cheung, Xiaola Lin, Keqin ...
This paper describes a new method for providingtransparent fault tolerance for parallel applications on a network of workstations. We have designed our method in the context of sh...
The number of processors embedded in high performance computing platforms is growing daily to solve larger and more complex problems. The logical network topologies must also suppo...
Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix ...