We have created ZapC, a novel system for transparent coordinated checkpoint-restart of distributed network applications on commodity clusters. ZapC provides a thin virtualization layer on top of the operating system that decouples a distributed application from dependencies on the cluster nodes on which it is executing. This decoupling enables ZapC to checkpoint an entire distributed application across all nodes in a coordinated manner such that it can be restarted from the checkpoint on a different set of cluster nodes at a later time. ZapC checkpoint-restart operations execute in parallel across different cluster nodes, providing faster checkpoint-restart performance. ZapC uniquely supports network state in a transport protocol independent manner, including correctly saving and restoring socket and protocol state for both TCP and UDP connections. We have implemented a ZapC Linux prototype and demonstrate that it provides low virtualization overhead and fast checkpointrestart times f...
Oren Laadan, Dan B. Phung, Jason Nieh