We present S, the first system to provide transparent, lowoverhead application record-replay and the ability to go live from replayed execution. S i...
A major challenge facing grid applications is the appropriate handling of failures. In this paper we address the problem of making parallel Java applications based on Remote Method...
It is widely accepted that transient failures will appear more frequently in chips designed in the near future due to several factors such as the increased integration scale. On t...
This paper presents an approach for integrating fault-tolerance techniques into microprocessors by utilizing instruction redundancy as well as time redundancy. Smaller and smaller...
Data aggregation plays an important role in the design of scalable systems, allowing the determination of meaningful system-wide properties to direct the execution of distributed a...