Abstract—Replication is a well-established approach to increasing database availability. Many database replication protocols have been proposed for the crash-stop failure model, ...
Key issues to address in autonomic job recovery for cluster computing are recognizing job failure; understanding the failure sufficiently to know if and how to restart the job; an...
Charles Earl, Emilio Remolina, Jim Ong, John Brown
Key issues to address in autonomic job recovery for cluster computing are recognizing job failure; understanding the failure sufficiently to know if and how to restart the job; an...
Charles Earl, Emilio Remolina, Jim Ong, John Brown...
This paper describes the implementation of a processorgroup membership protocol in an experimental real-time network. The protocol is appropriate for fault-tolerant distributed sy...
A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expen...
George Candea, Shinichi Kawamoto, Yuichi Fujiki, G...