A common debugging strategy involves reexecuting a program (on a given input) over and over, each time gaining more information about bugs. Such techniques can fail on message-passing parallel programs. Because of variations in message latencies and process scheduling, different runs on the given input may produce different results. This non-repeatability is a serious debugging problem, since an execution cannot always be reproduced to track down bugs. This paper presents a technique for tracing and replaying message-passing programs for debugging. Our technique is optimal in the common case and has good performance in the worst case. By making run-time tracing decisions, we trace only a fraction of the total number of messages, gaining two orders of magnitude reduction over traditional techniques which trace every message. Experiments indicate that only 1% of the messages often need be traced. These traces are sufficient to provide replay, allowing an execution to be reproduced any n...
Robert H. B. Netzer, Barton P. Miller