As high performance clusters continue to grow in size, the mean time between failure shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challengi...
The combination of Grid technology and web services has produced an attractive platform for deploying distributed applications: Grid services, as represented by the Open Grid Serv...
Xianan Zhang, Dmitrii Zagorodnov, Matti A. Hiltune...
All-to-all broadcast is one of the common collective operations that involve dense communication between all processes in a parallel program. Previously, programmable Network Inte...
Cluster computing environments built from commodity hardware have provided a cost-effective solution for many scientific and high-performance applications. Likewise, middleware te...
Many of the modern networks used to interconnect nodes in cluster-based computing systems provide network interface cards (NICs) that offer programmable processors. Substantial re...
Adam Wagner, Hyun-Wook Jin, Dhabaleswar K. Panda, ...
Emerging infrastructure of computational grids composed of Clusters-of-Clusters (CoC) interlinked through high throughput channels promises unprecedented raw compute power for ter...
In multicluster systems, and more generally, in grids, jobs may require co-allocation, i.e., the simultaneous allocation of resources such as processors and input files in multipl...
Gang Scheduling and related techniques are widely believed to be necessary for efficientjob scheduling on distributed memory parallel computers. This is hecause they minimize cont...
The complexity and cost of isolating the root cause of system problems in large parallel computers generally scales with the size of the system. Syslog messages provide a primary ...