InCheck: An Integrated Safe and Fast Recovery Scheme from Soft Errors
In a fault tolerant system, recovery routine should be involved right after error detection and eliminate the effect of error from the system. The simplest recovery strategy is global restarting which in a system will be restored back to the initial state and restart its computations. Since the recovery latency of restarting a program from the beginning can be considerably large, this strategy is not appropriate in many systems i.e, long-running, interactive and real-time applications. In checkpoint/rollback schemes, the snapshot of the state of the processor (mainly register file and memory), named checkpoint, is saved on a safe storage periodically and the program execution rollbacks to the last checkpoint in the case of error. In these schemes, checkpointing frequency determines the trade-off between performance overhead and recovery latency. If a system needs fast recovery, several checkpoints are required which imposes significant overhead to the system. Moreover, the checkpointing process itself should be error-free because some errors, i.e, latent errors, may happen before a checkpoint and cause preserving the wrong data and get detected long after the checkpoint. In such cases, even restoring a program from the last checkpoint cannot revert the effect of error from the computations.
To facilitate a fast and safe recovery schemes for critical time-sensitive applications, we propose InCheck as an application-integrated recovery scheme. Against the external checkpointing libraries which suffer from the latent-error problem and unacceptable overhead, InCheck performs light-checkpointing at basic block granularity and provides safe recovery by making sure that no error can skip to the checkpointing storage. InCheck achieves safe recovery by running diagnosis routine after each error detection rather than blindly apply program restoration from the last checkpoint. Extensive fault injection results demonstrate that InCheck-protected programs can successfully recover from all errors and produce expected outputs in expected time.
Link to the publication.