New single-machine environments are emerging from abundant computation available through multiple cores and secure virtualization. In this paper, we describe the research challeng...
Constructing logical machines out of collections of physical machines is a well-known technique for improving the robustness and fault tolerance of distributed systems. We present...
Yair Amir, Brian A. Coan, Jonathan Kirsch, John La...
Fault tolerance is an important issue for large machines with tens or hundreds of thousands of processors. Checkpoint-based methods, currently used on most machines, rollback all ...
Virtualization provides the possibility of whole machine migration and thus enables a new form of fault tolerance that is completely transparent to applications and operating syst...
This paper presents a software architecture for hardware fault tolerance based on loosely-synchronized, redundant virtual machines (LSRVM). LSRVM will provide high levels of relia...