This paper describes and evaluates two algorithms for performing on-line failure recovery (data reconstruction) in redundant disk arrays. It presents an implementation of disk-ori...
Mark Holland, Garth A. Gibson, Daniel P. Siewiorek
In the research reported in this paper, transient faults were injected in the nodes and in the communication subsystem (by using software fault injection) of a commercial parallel...
This paper presents a benchmark for dependablesystems. The benchmark consists of two metrics, number of catastrophic incidents and performance degradation, which are obtained by a...
Most fault-tolerant systems are designed to stop faulty programs before they write permanent data or communicate with other processes. This property (halt-on-failure) forms the co...
Programs fail mainly for two reasons: logic errors in the code, and exception failures. Exception failures can account for up to 2/3 of system crashes [6], hence are worthy of ser...