On the extreme-scale computers of the future, faults will become increasingly common.
In this technology, the number of individual components grows without a compensating improvement in reliability. Currently, achieving resilience is expensive. It inevitably requires redundancy and, therefore, more system resources and additional energy.
Traditional checkpoint techniques collect and transfer the data regularly from all compute nodes, and then store this data to backup memory. However, this approach will be too expensive and too slow in extreme-scale computing.
In this seminar, we will explore a two-pronged complementary approach that exploits application-specific features and framework support for resilience. This approach will both reduce the amount of redundancy and speed up the recovery process.
Associate Professor Linda Stals will concentrate on the mathematical properties of the algorithm to determine the minimal amount of information that needs to be stored in order to recover from a fault.
Dr Josh Milthorpe will review support for resilience in the frameworks and programming models for high-performance computing, with the goal of providing low-overhead resilience with minimal programmer effort.