In traditional distributed simulation schemes, entire simulation needs to be restarted if any of the participating LP crashes. This is highly undesirable for long running simulations. Some form of fault-tolerance is required to minimize the wasted computation. In this paper, a rollback based optimistic faulttolerance scheme is integrated with an optimistic distributed simulation scheme. In rollback recovery schemes, checkpoints are periodically saved on stable storage. After a crash, these saved checkpoints are used to restart the computation. We make use of the novel insight that a failure can be modeled as a straggler event with the receive time equal to the virtual time of the last checkpoint saved on stable storage. This results in saving of implementation e orts, as well as reduced overheads. We de ne stable global virtual time SGVT, as the virtual time such that no state with a lower timestamp will ever be rolled back despite crash failures. A simple change is made in existing...
Om P. Damani, Vijay K. Garg