Software systems are known to suffer from outages due to transient errors. Recently, the phenomenon of “software aging”, one in which the state of the software system degrades with time, has been reported. To counteract this phenomenon,a proactive approach of fault management, called “software rejuvenation”, has been proposed. This essentially involves gracefully terminating an application or a system and restarting it in a clean internal state. In this paper, we discuss stochastic models to evaluate the effectiveness of proactive fault management in operational software systems and determine optimal times to perform rejuvenation, for different scenarios. The latter part of the paper deals with measurement-based methodologies to detect software aging and estimate its effect on various system resources. Models are constructed using workload and resource usage data collected from the UNIX operating system over a period of time. The measurement-based models are intended to help d...
Kishor S. Trivedi, Kalyanaraman Vaidyanathan, Kate