Root-cause analysis after a system failure/error is an important activity to determine exact reasons for failure/error. Most of the time, these error conditions cannot be reproduced or it is not feasible to run the system again using the exact same scenario. Therefore, execution trace log of various functions/components recorded during the event is essential for root cause analysis and debugging in a complex system. Source code level instrumentation for dynamic analysis provides accurate execution trace log. But it is difficult to use an instrumented system in production environments because of performance and system stability issues. In a distributed system, intercepted network messages can be analyzed to identify interactions between various components of the system. However, messages captured on network alone do not provide complete information because messages between components on same host would not appear on network. We present a new idea to construct interaction information am...
Atul Kumar, Anil R. Nair