As we continue to evolve into large-scale parallel systems, many of them employing hundreds of computing engines to take on mission-critical roles, it is crucial to design those s...
Yanyong Zhang, Mark S. Squillante, Anand Sivasubra...
Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. Periodic application checkpointing is a commo...
As the scale is expanding, node failure becomes a commonplace feature of large-scale cluster systems. As an important part of cluster operating system software, job scheduling tak...
Linping Wu, Dan Meng, Jianfeng Zhan, Wang Lei, Bib...
In this paper we present an initial analysis of job failures in a large-scale data-intensive Grid. Based on three representative periods in production, we characterize the interar...
Hui Li, David L. Groep, Lex Wolters, Jeffrey Templ...
We propose a hybrid approach for scheduling real-time tasks on large-scale multicore platforms with hierarchical shared caches. In this approach, a multicore platform is partition...
John M. Calandrino, James H. Anderson, Dan P. Baum...