This work addresses the problem of diagnosing configuration errors that cause a system to function incorrectly. For example, a change to the local firewall policy could cause a network-based application to malfunction. Our approach is based on searching across time for the instant the system transitioned into a failed state. Based on this information, a troubleshooter or administrator can deduce the cause of failure by comparing system state before and after the failure. We present the Chronus tool, which automates the task of searching for a failure-inducing state change. Chronus takes as input a user-provided software probe, which differentiates between working and non-working states. Chronus performs "time travel" by booting a virtual machine off the system's disk state as it existed at some point in the past. By using binary search, Chronus can find the fault point with effort that grows logarithmically with log size. We demonstrate that Chronus can diagnose a range...
Andrew Whitaker, Richard S. Cox, Steven D. Gribble