Resilience from Soft Faults in Large-scale Scientific Simulations

Description

As scientific simulations move towards very large scale, `soft errors' become a limiting issue. These arise from random bit flips in memory cells, on data paths in the memory hierarchy, and within the CPU. On very large scale systems, their sheer size increases the frequency of such faults to well within the execution time of the simulation. These soft errors are silent (generally do not cause exceptions) but cause errors to propagate through the data fields as the simulation evolves. On smaller systems, soft errors can still be problematic when the system is run under minimal power (i.e. to save energy), is made up of very cheap but less reliable components or is under harsh operating conditions. The detection and recovery from such errors in large-scale computations is therefore a pressing problem. Using new mathematical techniques which are naturally fault-tolerant, this project will explore general solutions to this problem. A number of scalable parallel applications will be studied, and machine learning techniques could be applied to detect and define the areas in the data fields where errors have been introduced by soft faults. By taking advantage of redundancy, the areas can then be `repaired'. Further details on the approach can be made available upon request. This is a joint research project between the Research School of Computer Science and the Mathematical Sciences Institute at ANU. Collaborators at the MSI include Professors Markus Hegland and Stephen Roberts.

Goals

  • to study and quantify the effect of soft faults across a range of applications
  • to determine methods of detecting and defining areas in the data which have been corrupted by soft faults, and to determine to what extent these are application-specific.
  • to develop highly scalable parallel algorithms for the detection and recovery form soft faults.

Requirements

Understanding of parallel computing concepts and programming and some experience in applied mathematics methods.

Background Literature

A. Geist and C. Engelmann. Development of naturally fault tolerant algorithms for computing on 100,000 processors, 2002. Jack Dongarra and Pete Beckman et al. International exascale software project  roadmap 1.0. CS Technical Report ut-cs-10-654, University of Tennessee, 2010.

Kuang-Hua Huang; Abraham, J.A., "Algorithm-Based Fault Tolerance for Matrix Operations," Computers, IEEE Transactions on , vol.C-33, no.6, pp.518,528, June 1984.

B. Harding, M. Hegland, J. Larson, and J. Southern. Scalable and fault tolerant computation with the sparse grid combination technique. ArXiv e-prints, April 2014.

Updated:  1 November 2018/Responsible Officer:  Dean, CECS/Page Contact:  CECS Marketing