Most of today‘s HPC systems employ a single head node for control, which represents a single point of failure as it interrupts an entire HPC system upon failure. Furthermore, it...
Kai Uhlemann, Christian Engelmann, Stephen L. Scot...
Continuously monitoring the link performance is important to network diagnosis. Recently, active probes sent between end systems are widely used to monitor the link performance. I...
Many performance problems observed in high end systems are actually caused by the runtime system and not the application code. Detecting these cases will require parallel performa...
Rashawn L. Knapp, Karen L. Karavanic, Douglas M. P...
Some recent Petri net-based approaches to fault diagnosis of distributed systems suggest to factor the problem into local diagnoses based on the unfoldings of local views of the sy...
— The problem of sensor activation in a controlled discrete event system is considered. Sensors are assumed to be costly and can be turned on/off during the operation of the syst...