Left unchecked, the fundamental drive to increase peak performance using tens of thousands of power hungry components will lead to intolerable operating costs and failure rates. R...
Abstract— Large scale distributed computing infrastructure captures the use of high number of nodes, poor communication performance and continously varying resources that are not...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a commo...
Some of the most challenging applications to parallelize scalably are the ones that present a relatively small amount of computation per iteration. Multiple interacting performanc...
Tracing and performance analysis tools are an important component in the development of high performance applications. Tracing parallel programs with current tracing tools, howeve...