On future extreme-scale computers, faults will become increasingly common as the number of individual components grows without a compensating improvement in reliability. Achieving resilience is expensive since it inevitably requires redundancy and thus more system resources and additional energy. Traditional checkpoint techniques collect and transfer the data regularly from all compute nodes and store the data to backup memory, but this will be too expensive and too slow in extreme-scale computing.
We will explore a two-pronged complementary approach that exploits application-specific features and framework support for resilience, to both reduce the amount of redundancy and to speed up the recovery process. Stals will concentrate on the mathematical properties of the algorithm to determine the minimal amount of information that needs to be stored in order to recover from a fault. Milthorpe will review support for resilience in the frameworks and programming models for high-performance computing, with the goal of providing low-overhead resilience with minimal programmer effort.