This paper describes a fault detection mechanism that uses the error codes returned by the stream sockets to locate process failures. Since these errors are generated automatically when there is communication with a failed process, the mechanism does not incur in any failure-free overheads. However, for some types of faults, detection can only be attained if the surviving processes use certain communication operations. To assess the coverage and latency of the proposed mechanism, faults were injected during the execution of parallel applications. Our results show that in most cases, faultscould be found using only the errors from the socket layer. Depending on the type of fault that was injected, detection occurred in an interval ranging from a few milliseconds to less than 9 minutes.
Nuno Neves, W. Kent Fuchs