The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For l...
We describe the design and implementation of a clustering service for a high-performance, shared-disk file system. The service provides failure detection and recovery, reliable e...
Abstract—We present reliability solutions for adaptable Network RAM systems running on general-purpose clusters. Network RAM allows nodes with over-committed memory to swap pages...
Benchmarks have historically played a key role in guiding the progress of computer science systems research and development, but have traditionally neglected the areas of availabi...