Application Resilience: Making Progress in Spite of Failure

16 years 1 months ago

Download www.parl.clemson.edu

Abstract—While measures such as raw compute performance and system capacity continue to be important factors for evaluating cluster performance, such issues as system reliability and application resilience have become increasingly important as cluster sizes rapidly grow. Although eﬀorts to directly improve fault-tolerance are important, it is also essential to accept that application failures will inevitably occur and to ensure that progress is made despite these failures. Application monitoring frameworks are central to providing application resilience. As such, the central theme of this paper is to address the impact that application monitoring detection latency has on the overall system performance. We ﬁnd that immediate fault detection is not necessary in order to obtain substantial improvement in performance. This conclusion is signiﬁcant because it implies that less complex, highly portable, and predominately less expensive failure detection schemes would provide adequate...

William M. Jones, John T. Daly, Nathan DeBardelebe

Real-time Traffic