The use of several distinct recovery procedures is one of the techniques that can be used to ensure high availability and fault-tolerance of computer systems. This method has been applied to telecommunications systems and usually uses redundant hardware and special recovery software to restore the system after hardware and software failures. We propose a simple practical analytical approach to availability evaluation of systems with several recovery procedures based on a new ‘segregated failures’ model. To illustrate this method, it is applied to availability evaluation of a Lucent Technologies Reliable Clustered Computing application. Detailed numerical results are provided and the impact of various types of failures and coverage factors on down time is analysed.
Sergiy A. Vilkomir, David Lorge Parnas, Veena B. M