Fault tolerance is one of the key issues for large scale applications executed on high performance computing systems. In a cluster federation, clusters are gathered to provide hug...
—Parallel performance monitoring extends parallel measurement systems with infrastructure and interfaces for online performance data access, communication, and analysis. At the s...
Aroon Nataraj, Allen D. Malony, Allen Morris, Dori...
In this article we present the design choices and the evaluation of a batch scheduler for large clusters, named OAR. This batch scheduler is based upon an original design that emp...
Nicolas Capit, Georges Da Costa, Yiannis Georgiou,...
Advances in multiprocessor interconnect technologyare leading to high performance networks. However, software overheadsassociated with message passing are limiting the processors ...
Debashis Basak, Dhabaleswar K. Panda, Mohammad Ban...
In this paper, we introduce Systematic P2P Aided Cache Enhancement or SPACE, a new collaboration scheme among clients in a computer cluster of a high performance computing facility...
Mohammad Mursalin Akon, Mohammad Towhidul Islam, X...