Many state-of-the-art approaches on fault-tolerant system design make the simplifying assumption that all faults are detected within a certain time interval. However, based on a d...
Jia Huang, Kai Huang, Andreas Raabe, Christian Buc...
As computational clusters increase in size, their mean-time-to-failure reduces. Typically checkpointing is used to minimize the loss of computation. Most checkpointing techniques, ...
—The number of CPUs in chip multiprocessors is growing at the Moore’s Law rate, due to continued technology advances. However, new technologies pose serious reliability challen...
Many areas of science currently use computing resources as a important part of their research, and many research groups adopt cluster architecture to use them efficiently and mana...
Hyuck Han, Jai Wug Kim, Jongpil Lee, Youngjin Yu, ...
We describe and test a software approach to overcoming radiation-induced errors in spaceborne applications running on commercial off-the-shelf components. The approach uses checks...