—Clusters and applications continue to grow in size while their mean time between failure (MTBF) is getting smaller. Checkpoint/Restart is becoming increasingly important for lar...
Crash dump, or core dump is the typical way to save memory image on system crash for future offline debugging and analysis. However, for typical server machines with likely abund...
With applications becoming larger and the increasing load on high performance systems, it is important to tackle the I/O bottleneck problem from several angles. It is not only ess...
Murali Vilayannur, Mahmut T. Kandemir, Anand Sivas...
Tools for parallel systems today range from specification over debugging to performance analysis and more. Typically, they help the programmers of parallel algorithms from the ea...
Parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. This paper introduces the L...