A Unit Test Toolchain for Asynchronous Many-Task Runtime Systems

Description

Asynchronous Many-Task (AMT) programming languages are the future for High Performance Computing (HPC). Conventional languages over-specify a computation in the sense that it imposes an execution order (e.g.. iterations of a loop) that is not inherent in the problem. Conversely, most concurrent and parallel languages also over-specify a computation, by imposing particular parallel execution orders. As such, other possibilities in parallelism cannot be exploited. Furthermore, it is difficult to exploit the parallelism in a problem which is on several different levels and/or is irregular.

In an AMT language (or programming model), the programmer defines tasks and merely has to express the control and data relationships (constraints) between them. This takes a massive burden off the programmer.  The run-time system can then extract as much or as little parallelism as appropriate, while preserving these constraints.  The run-time system can also take care of details of the underlying target computer architecture, whether it be multiple cores on a single socket (CPU chip), many cores within an accelerator, multiple sockets, multiple nodes connected by a network, or any combination of these. Thus, AMT programming models offer significant portability advantages over the traditional models. These are primarily OpenMP, which can  only (easily) express parallelism within the cores of a node, and the Message Passing Interface, which targets multiple cores distributed over a network.

AMT programing models also have the advantage that they can more easily tolerate faults, for example DRAM bit-flips or processor failure. This is because the components of the computation are encapsulated in a strong way (tasks and the data they operated on). Tasks can be re-run on another process if the process they were running on fails.  Tasks can be re-run to detect bit-flip errors.

This project will develop infrastructure for AMT runtimes, so they will be ready for current and future HPC systems, both for the low-end (e.g. a single socket server computer) and the high-end (massively parallel multi-level supercomputer systems).
 

Goals

Unit testing is a critical element of software engineering that is vastly underutilized in most high performance computing (HPC) development environments. As many scientific applications make the transition from bulk- synchronous, MPI-like programming models to asynchronous many-task (AMT) runtime systems, the need for thorough unit testing has never been more critical. This project aims to support this need by providing a test-driven development environment and toolchain that matches the way that domain scientists develop their ideas into software: by providing an incremental, task-centric unit testing framework that spans the development spectrum from proof-of-concept to production-level implementation.

The idea for this project stems from the fact that, despite a significant difference in function, the requirements for creating a “task” and for creating a “unit” are very similar from the user’s perspective. In particular, the emphasis on self-containment, clear pre- and post-state specification, relatively minimal size and complexity, and invariance with respect to other independent tasks/units are key aspects of both task and unit development. Because of this fundamental connection, it makes sense to design a simple toolchain that leverages tasks as unit tests with minimal extra effort on the user’s part.

The project will involve choosing an existing AMT programming model and runtime system (e.g. Charm++ and Legion), and develop a toolchain to support unit testing. Python is a promising language for its implementation because of its ubiquity and open nature.

 

Requirements

An Honours degree in computer science or equivalent. Some background in high performance computing and software engineering is desirable. Some tenacity and initiative in maintaining long-distance collaboration will be required.
 

Gain

The project represents the opportunity to work with an international team, including researchers at Sandia National Laboratories who are developing state-of-the-art AMT runtime systems. It also will give opportunity to interact with scientists working with large-scale applications.  There will also be unusual opportunities to travel, including internships at Sandia Laboratories (California). A top-up scholarship will be available (to be confirmed).
 

Keywords

High performance computing, supercomputing, fault-tolerance, DAG-based prgramming models, asynchronous many-task prgramming models, software engineering.

Updated:  1 November 2018/Responsible Officer:  Dean, CECS/Page Contact:  CECS Marketing