As high performance clusters continue to grow in size, the mean time between failure shrinks. Thus, the issues of fault tolerance and reliability are becoming one of the challengi...
Over the last twenty years the interfaces for accessing persistent storage within a computer system have remained essentially unchanged. Simply put, seek, read and write have deļ¬...
Xiangyong Ouyang, David W. Nellans, Robert Wipfel,...
AbstractāIn this paper, a tool named CheCUDA is designed to checkpoint CUDA applications that use GPUs as accelerators. As existing checkpoint/restart implementations do not supp...
Resource discovery is an important process for ļ¬nding suitable nodes that satisfy application requirements in large loosely-coupled distributed systems. Besides inter-node heter...
Markets of computing resources typically consist of a cluster (or a multi-cluster) and jobs that arrive over time and request computing resources in exchange for payment. In this p...
Sergei Shudler, Lior Amar, Amnon Barak, Ahuva Mu'a...