-- A hardware fault tolerance scheme for large multicomputers executing time-consuming non-interactive applications is described. Error detection and recovery are done mostly by software with little hardware support. The scheme is based on simultaneous execution of identical copies of the application on two subnetworks of the system. Normal system operation is periodically suspended and the logical states of the two subnetworks are synchronized. Errors are detected by comparing the ``frozen'' synchronized states of the two subnetworks while they are being saved as ``checkpoints'' for possible subsequent use for error recovery. Algorithms for error detection and recovery using this scheme are discussed.