As the number of devices available per chip continues to increase, the computational potential of future computer architectures grows likewise. While this is a clear benefit for f...
Scalability and reliability are inseparable in high-performance computing. Fault-isolation through hardware is a popular means of providing reliability. Unfortunately, such isolat...
Service-orientation has been proposed as a way of facilitating the development and integration of increasingly complex and heterogeneous system components. However, there are many...
This paper describes a fault detection mechanism that uses the error codes returned by the stream sockets to locate process failures. Since these errors are generated automaticall...
Large scale compute clusters continue to grow to ever-increasing proportions. However, as clusters and applications continue to grow, the Mean Time Between Failures (MTBF) has redu...