Heartbeat protocols are used by distributed programs to ensure that if a process in a program terminates or fails, then the remaining processes in the program terminate. We present a class of heartbeat protocols that tolerate message loss. In these protocols, a root process periodically sends a beat message to every other process then waits to receive a reply beat message from every other process. If the root process does not receive a reply (possibly due to message loss), the root process reduces by half the period for sending beat messages. We show that in practical situations, the parameters of these protocols can be chosen to achieve a good compromise between three contradictory objectives: reduce the rate of sending beat messages, reduce the detection delay, and still keep the probability of premature termination small.
Mohamed G. Gouda, Tommy M. McGuire