This paper deals with the monitoring and diagnosis of large discrete-event systems. The problem is to determine, online, all faults and states that explain the flow of observatio...
Efficient algorithms exist for fault detection and isolation of physical systems based on functional redundancy. In a qualitative approach, this redundancy can be captured by a tem...
Faults that occur in production systems are the most important faults to fix, but most production systems lack the debugging facilities present in development environments. TraceB...
Andrew Ayers, Richard Schooler, Chris Metcalf, Ana...
Abstract. An important step in achieving robustness to run-time faults is the ability to detect and repair problems when they arise in a running system. Effective fault detection a...
Paulo Casanova, Bradley R. Schmerl, David Garlan, ...
Designing a distributed fault tolerance algorithm requires careful analysis of both fault models and diagnosis strategies. A system will fail if there are too many active faults, ...