Checkpointing and replaying is an attractive technique that has been used widely at the operating/runtime system level to provide fault tolerance. Applying such a technique at the...
Recently there has been renewed interest in building reliable servers that support continuous application operation. Besides maintaining system state consistent after a failure, o...
Management of large-scale parallel and distributed applications is an extremely complex task due to factors such as centralized management architectures, lack of coordination and ...
This paper discusses extensions to the Rover toolkit for constructing reliable mobile-aware applications. The extensions improve upon the existing failure model, which only addres...
The inherent redundancy and in-the-field reconfiguration capabilities of field programmable gate arrays (FPGAs) provide alternatives to integrated circuit redundancy-based fault r...
John Lach, William H. Mangione-Smith, Miodrag Potk...