Resilient Programming Models for One-Sided Communications



Checkpoint-restart is commonly used to provide resilience to fail-stop faults (e.g. node failures) for HPC applications.  However, as mean-time-to-failure shortens with increasing system size, checkpoint-restart does not scale as it is not possible to checkpoint the entire system memory between failures.

Alternative models such as MPI User-Level Fault Mitigation [1] and Resilient X10 [2] have not addressed one-sided communication, which creates particular challenges for maintaining correctness and progress in the presence of process failures.

This work could start with a baseline of either MPI-3 [3] or GASNet [4] and define control flow, update semantics and recovery operations for resilient operation in the presence of arbitrary process failures.








Updated:  10 February 2019/Responsible Officer:  Dean, CECS/Page Contact:  CECS Marketing