This paper describes Project Kittyhawk, an undertaking at IBM Research to explore the construction of a nextgeneration platform capable of hosting many simultaneous web-scale work...
Given the scale of massively parallel systems, occurrence of faults is no longer an exception but a regular event. Periodic checkpointing is becoming increasingly important in the...
The threat of soft error induced system failure in high performance computing systems has become more prominent, as we adopt ultra-deep submicron process technologies. In this pap...
In this paper, we define and study a new class of capacity planning models called MAP queueing networks. MAP queueing networks provide the first analytical methodology to describe ...
Software failures in server applications are a significant problem for preserving system availability. We present ASSURE, a system that introduces rescue points that recover softw...
Stelios Sidiroglou, Oren Laadan, Carlos Perez, Nic...