Sciweavers

SC
2015
ACM

Infrastructure for In Situ System Monitoring and Application Data Analysis

8 years 8 months ago
Infrastructure for In Situ System Monitoring and Application Data Analysis
We present an architecture for high-performance computers that integrates in situ analysis of hardware and system monitoring data with application-specific data to reduce application runtimes and improve overall platform utilization. Large-scale high-performance computing systems typically use monitoring as a tool unrelated to application execution. Monitoring data flows from sampling points to a centralized off-system machine for storage and post-processing when root-cause analysis is required. Along the way, it may also be used for instantaneous threshold-based error detection. Applications can know their application state and possibly allocated resource state, but typically, they have no insight into globally shared resource state that may affect their execution. By analyzing performance data in situ rather than off-line, we enable applications to make real-time decisions about their resource utilization. We address the particular case of in situ network congestion analysis an...
Jim M. Brandt, Karen D. Devine, Ann C. Gentile
Added 17 Apr 2016
Updated 17 Apr 2016
Type Journal
Year 2015
Where SC
Authors Jim M. Brandt, Karen D. Devine, Ann C. Gentile
Comments (0)