nZDC: A Compiler Technique for Near Zero Silent Data Corruption
Researcher: Moslem Didehban
Aggressive transistor feature size scaling (10-7nm) and high integration density (billions of transistor) of modern microprocessors have made hardware components prone to the different type of faults. Transient faults or soft errors are the main thread for sub-nano sized transistors. These high-energy particles can cause an unexpected change in the logical value of a circuit and introduce incorrectness in computation and/or alter the timing behavior of the system.
Redundancy is the key idea to detect the effect of soft errors. The state-of-the-art software-level redundancy techniques like SWIFT and Shoestring replicate computational instructions of a program and check these redundantly-computed results before critical instructions, i.e, memory, and control-flow instructions. Although these schemes can detect the manifestation of an error in such instructions, yet they cannot cover the impact of soft errors on the execution of critical instructions themselves. The existing work suffers from these significant limitations because duplicating critical instructions are challenging. For instance, simply duplicating memory write instructions does not add to coverage or replication of branch instructions can be meaningless.
We present nZDC as a software-only comprehensive error detection scheme which can detect the impact of soft errors on the execution of all program instructions in various hardware components. nZDC scheme duplicates program instructions as much as it can and for the one that simply duplication is not useful, nZDC adopts novel checking mechanisms. nZDC is designed to provide full coverage and guarantee the error free output — a claim that no prior work can make. We implement nZDC as a set of back-end compiler passes in LLVM compilation infrastructure for ARMV8 instructions set architecture. We performed extensive fault injection experiments on a Cortex A53-like simulated microprocessor in GEM5 simulator and found out that nZDC can detect the manifestation of all soft errors and zero wrong or late output were observed.
Link to the publication.