Sciweavers

IPPS
2005
IEEE

Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance

14 years 5 months ago
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance
Checkpoint/restart is a general idea for which particular implementations enable various functionalities in computer systems, including process migration, gang scheduling, hibernation, and fault tolerance. For fault tolerance, in current practice, implementations can be at user-level or system-level. User-level implementations are relatively easy to implement and portable, but suffer from a lack of transparency, flexibility, and efficiency, and in particular are unsuitable for the autonomic (self-managing) computing systems envisioned as the next revolutionary development in system management. In contrast, a system-level implementation can exhibit all of these desirable features, at the cost of a more sophisticated implementation, and is seen as an essential mechanism for the next generation of fault tolerant—and ultimately autonomic—large-scale computing systems. Linux is becoming the operating system of choice for the largest-scale machines, but development of system-level che...
José Carlos Sancho, Fabrizio Petrini, Kei D
Added 25 Jun 2010
Updated 25 Jun 2010
Type Conference
Year 2005
Where IPPS
Authors José Carlos Sancho, Fabrizio Petrini, Kei Davis, Roberto Gioiosa, Song Jiang
Comments (0)