Co-Lab Seminar Series: Fault-Tolerant Algorithms and Frameworks for Extreme-Scale Computing

On future extreme-scale computers, faults will become increasingly common as the number of individual components grows without a compensating improvement in reliability. Achieving resilience is expensive since it inevitably requires redundancy and thus more system resources and additional energy. Traditional checkpoint techniques collect and transfer the data regularly from all compute nodes and store the data to backup memory, but this will be too expensive and too slow in extreme-scale computing.

We will explore a two-pronged complementary approach that exploits application-specific features and framework support for resilience, to both reduce the amount of redundancy and to speed up the recovery process. Stals will concentrate on the mathematical properties of the algorithm to determine the minimal amount of information that needs to be stored in order to recover from a fault. Milthorpe will review support for resilience in the frameworks and programming models for high-performance computing, with the goal of providing low-overhead resilience with minimal programmer effort.

Date & time

3–5pm 21 Feb 2020

Location

Room:Seminar Room, 1.33

Internal speakers

Dr Josh Milthorpe

Speakers

A/Prof Linda Stals

Contacts

02 6125 2394

Updated:  1 June 2019/Responsible Officer:  Dean, CECS/Page Contact:  CECS Marketing