Making 3D Scientific Applications Fault Tolerant


At the heart of most scientific applications is a PDE solver, which evolves, timestep by timestep, the solution of the Partial Differential Equations. As scientific simulations move towards very large scale, `soft errors' become a limiting issue. These arise from random bit flips in memory cells, on data paths in the memory hierarchy, and within the CPU. On very large scale systems, their sheer size increases the frequency of such faults to well within the execution time of the simulation. These soft errors are silent (generally do not cause exceptions) but cause errors to propagate through the data fields as the simulation evolves.

One of the simplest and most common  PDE solver is for advection, which for example can be used to model the transport of air, water and waves. 3D versions of this problem are by far the most important use case.

Robust stencils is a general technique to make (stencil-type) PDE solvers tolerant to soft faults. It involves computing each updated point in the simulation in a way that any corrupted point in the neighborhood is avoided. The techniques has been very recently demonstrated on 1D and 2D advection and other solvers, and proved competitive against established methods (Triple Modular Redundancy) . But no robust stencil have yet been developed for 3D solvers!


The goal of this project is to derive 3D robust stencils for advection, implement them and evaluate against competitor methods (e.g. TMR). Use can be made of an existing code base for a(normal) 2D/3D advection solver. Derivation involves taking a `cross product' of 1D stencils - the procedure uses straightforward algebraic techniques but is complex. If successful, the work should be publishable.


Stromg programming ability in C/C++, and a strong aptitude in basic mathematics.

Background Literature

Peter Strazdins, Brendan Harding, Chung Lee, Jackson R. Mayo, Jaideep Ray, and Robert C. Armstrong. A Robust Technique to Make a 2D Advection Solver Tolerant to Soft Faults, Procedia Computer Science, Volume 80, 2016, Pages 1917--1926 (International Conference on Computational Science, San Diego, Jun 2016).



  • exascale computing; fault-tolerance; partial differential equations; robust stencils; advection equation;parallel computing; resilient computing

Updated:  10 February 2019/Responsible Officer:  Dean, CECS/Page Contact:  CECS Marketing