Fault-Tolerant Distributed Simulation

15 years 11 months ago

Download www.it.iitb.ac.in

In traditional distributed simulation schemes, entire simulation needs to be restarted if any of the participating LP crashes. This is highly undesirable for long running simulations. Some form of fault-tolerance is required to minimize the wasted computation. In this paper, a rollback based optimistic faulttolerance scheme is integrated with an optimistic distributed simulation scheme. In rollback recovery schemes, checkpoints are periodically saved on stable storage. After a crash, these saved checkpoints are used to restart the computation. We make use of the novel insight that a failure can be modeled as a straggler event with the receive time equal to the virtual time of the last checkpoint saved on stable storage. This results in saving of implementation e orts, as well as reduced overheads. We de ne stable global virtual time SGVT, as the virtual time such that no state with a lower timestamp will ever be rolled back despite crash failures. A simple change is made in existing...

Om P. Damani, Vijay K. Garg

Real-time Traffic