The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For l...
We introduce Re-FUSE, a framework that provides support for restartable user-level file systems. Re-FUSE monitors the user-level file-system and on a crash transparently restart...
Traditional agreement-based Byzantine fault-tolerant (BFT) systems process all requests on all replicas to ensure consistency. In addition to the overhead for BFT protocol and sta...
The importance of transient faults is predicted to grow due to current technology trends of increased scale of integration. One of the components that will be significantly affecte...
Secure, fault-tolerant distributed systems are difficult to build, to validate, and to operate. Conservative design for such systems dictates that their security and fault toleran...
Coverage, fault tolerance and power consumption constraints make optimal placement of mobile sensors or other mobile agents a hard problem. We have developed a model for describin...
Large scale compute clusters continue to grow to ever-increasing proportions. However, as clusters and applications continue to grow, the Mean Time Between Failures (MTBF) has redu...
Fault-tolerant services typically make assumptions about the type and maximum number of faults that they can tolerate while providing their correctness guarantees; when such a fau...
Byung-Gon Chun, Petros Maniatis, Scott Shenker, Jo...
The ability to decompose a complex, long-running query into simpler queries that produce the same result is useful for many scenarios, such as admission control, resource manageme...
Nicolas Bruno, Vivek R. Narasayya, Ravishankar Ram...