Predicting Scientific Application Performance on Ultra-large Scale Supercomputers

Description

Modern large-scale supercomputers consist of a large number of multicore processors nodes connected by a complex high performance network. Efficiently and accurately predicting application performance on supercomputers that are either not currently available, or are larger or have different networks than those that are, is an extremely important open problem. SST/Macro is an open-source simulation tool developed at Sandia Laboratories that can address this problem. The approach is to `skeletonize' an application by replacing the purely computational sections of an application with code which simply informs the simulator how long that section would have taken. The intervening purely communication sections (typically expressed as calls to the MPI message passing library) are interpreted by the simulator, which uses a model of the underlying network to estimate how long that call would have taken. In this way, even a very large scale computation running over many cores can be efficiently simulated on a desktop. Two issues however arrive: how to develop the skeleton efficiently and how to determine whether the simulator's prediction is accurate.

Goals

The goal of this project is to develop a skeleton for some non-trivial applications of interest. This includes at the least an advection application with algorithmically based fault tolerance (ABFT) using the sparse grids technique. Issues in the development of the skeleton(s) will be noted, and the accuracy of the SST/macro prediction will be analyzed. This will involve validation, i.e. comparison of the predicted performance with the real performance of the original application(s) on a suitable supercomputer Extensions of this project include improving infrastructure to assist the validation process. One way this could be done is by integrating the MPI-based performance monitoring tool IPM (also based on MPI). For the Masters by Research and PhD level, the project would involve developing systematic and mechanically assisted ways of deriving skeletons, essential if the approach is to be used for real-world applications. A second direction would be the investigation of systematically deriving accurate network models.

Requirements

Some experience in high performance computing is highly desirable, especially for single-semester projects. Experience and competency in working with C++ software.

Background Literature

Curtis L. Janssen, Helgi Adalsteinsson, Scott Cranford, Joseph P. Kenny, Ali Pinar, David A. Evensky, and Jackson Mayo. A simulator for large-scale parallel architectures. International Journal of Parallel and Distributed Systems, 1(2):57-73, 2010 Andrew Kerr, Eric Anger, Gilbert Hendry, and Sudhakar Yalamanchili.Eiger: A framework for the automated synthesis of statistical performance models.

Gain

This project intersects with ANU's collaboration with Sandia Laboratories.

Updated:  1 November 2018/Responsible Officer:  Dean, CECS/Page Contact:  CECS Marketing