This paper addresses the problem of scheduling concurrent jobs on clusters where application data is stored on the computing nodes. This setting, in which scheduling computations ...
Michael Isard, Vijayan Prabhakaran, Jon Currey, Ud...
In this paper we present a cost-effective, high bandwidth server I/O network architecture, named PaScal (Parallel and Scalable). We use the PaScal server I/O network to support da...
Scientists increasingly rely on the execution of workflows in grids to obtain results from complex mixtures of applications. However, the inherently dynamic nature of grid workflo...
There is a growing need for systems that can monitor and analyze application performance data automatically in order to deliver reliable and sustained performance to applications....
Lingyun Yang, Jennifer M. Schopf, Catalin Dumitres...
To improve the whole dependability of large-scale cluster systems, an online fault detection mechanism is proposed in this paper. This mechanism can detect the fault in time befor...