Faults that occur in production systems are the most important faults to fix, but most production systems lack the debugging facilities present in development environments. TraceBack provides debugging information for production systems by providing execution history data about program problems (such as crashes, hangs, and exceptions). TraceBack supports features commonly found in production environments such as multiple threads, dynamically loaded modules, multiple source languages (e.g., Java applications running with JNI modules written in C++), and distributed execution across multiple computers. TraceBack supports first fault diagnosis—discovering what went wrong the first time a fault is encountered. The user can see how the program reached the fault state without having to re-run the computation; in effect enabling a limited form of a debugger in production code. TraceBack uses static, binary program analysis to inject lowoverhead runtime instrumentation at control-flow block...