Refactor NCAR applications for performance portability

CISL continues to make strategic investments in the computational efficiency of applications that are critical to NCAR’s fundamental scientific mission. In FY2019, these investments focused on improving the efficiency of the Community Earth System Model and the Data Assimilation Research Testbed on NCAR’S Cheyenne supercomputer and other CPU-based architectures and on porting and optimizing of Model Prediction Across Scales-Atmosphere and MURaM solar physics models on GPU architectures. These efforts will increase the amount of NCAR science that can be done on existing infrastructure and enable potentially significant increases in what can be done on future computing hardware. This work was enabled in FY2019 largely by NSF Core funding and with external funding as noted below.

Community Earth System Model optimizations

The sustained, multi-year effort to optimize the Community Earth System Model (CESM) is a collaborative effort between CISL and the CESM Software Engineering Group in the Climate and Global Dynamics (CGD) Laboratory. It aims to enable modeling-based science through the more efficient use of computational resources. This effort continued to bear fruit in 2019; the optimized version of CESM2 was deployed as the workhorse for the massive Coupled Model Intercomparison Project computational campaign. CISL estimates this effort has reduced the total cost of a CESM2 run by approximately 13%, saving tens of millions of CPU-hours. The reduction in costs varies for different scientific configurations. Optimizations to the Conservative Semi-Lagrangian Multi-tracer transport scheme, for example, reduce the cost of the Whole Atmosphere Community Climate Model with thermosphere and ionosphere extensions by 30 to 40%. These impacts are gauged by estimating the known performance relative to what it would be had the optimization effort not been undertaken.

In FY2019, CISL focused its optimization efforts on two physics parameterizations: the Cloud Layers Unified by Binormals (CLUBB) parameterization and Morrison-Gettelman microphysics version 2 (MG2). CLUBB is the single most expensive physics package in the Community Atmosphere Model (CAM) component of CESM. CISL provided a detailed performance analysis of CLUBB to the developers at the University of Wisconsin, which enabled them to reduce its total cost by approximately 40%. In the case of MG2, CISL researchers developed a long-vector-length version and used it to investigate using reduced precision with the CAM; achieved a 20x speedup on the NEC vector engine; and developed a directive-based approach in OpenACC for execution on graphics processing units (GPUs).

Data Assimilation Research Testbed optimizations

In FY2019, staff in CISL’s Application Scalability and Performance group continued to work on a more scalable version of the Data Assimilation Research Testbed (DART) assimilation loop, which groups independent observations using a graph-coloring algorithm. The original implementation required a global synchronization for each observation in the form of an MPI broadcast to all ranks, whereas the new algorithm reduces the number of global synchronizations required to the number of colors. For a 192k-observation data set, this reduced the total number of synchronizations needed by several orders of magnitude. It reduced communication time in DART by an average of 90% on four to 384 Cheyenne nodes. The graph-coloring modification is promising for configurations with a large number of observations on systems without high-speed interconnects.

GPU optimizations

CISL engineers in the Special Technical Projects group have been working with Mesoscale and Microscale Meteorology (MMM) Laboratory scientists and engineers – along with undergraduate and graduate students and faculty from the University of Delaware and the University of Wyoming – to create CPU-GPU portable versions of two important applications: the Model for Prediction Across Scale-Atmosphere (MPAS-A) and the MPS/University of Chicago Radiative MHD (MURaM) model, a flagship solar physics model used by the High Altitude Observatory (HAO). Both have been the objects of intensive refactoring. The chosen porting method for these models involves the use of the OpenACC directive-based, open standard programming model. Both projects have met significant milestones and are described below.

MPAS-A

The NCAR-led application team completed an initial version of GPU-enabled MPAS-A. It is undergoing pre-production testing by The Weather Company (TWC), an IBM subsidiary, for use as the forecast model in the Global High-Resolution Atmospheric Forecast model. This work was performed under a Joint Development Agreement (JDA) between TWC, IBM, and UCAR. CISL received $100,000 in funding in through this JDA in FY2019.

The JDA project had three principal goals: achieve good performance on GPUs without compromising CPU performance; retain the readability and maintainability of the source; and maintain open-source availability of the resulting optimized CPU/GPU code to the scientific community. These goals were met. The MPAS-A team found that the moist dynamical core achieves approximately three times the performance of a dual-socket, 18-core Intel Xeon Broadwell node when using a single, current-generation NVIDIA Volta GPU (see 10 km CPU and GPU results in Figure 1). For maintainability and readability, by using OpenACC directives the project avoided duplicating source or toggling architecture-specific code in and out with “ifdef” statements. In the end, after porting about 54,000 lines of MPAS-A source code, the source size had grown by only 5-10%.

The MPAS-A project also implemented physics-based task parallelism as a way to further accelerate the model on heterogeneous CPU-GPU architectures. Leveraging an approach first proposed and evaluated by scientists at the Geophysical Fluid Dynamics Laboratory, in which the radiative transport (RT) calculations can be run concurrently with the dynamics and other physics, this strategy had the dual advantages of reducing the amount of code to port to GPUs and speeding up model throughput. The CPU-based RT domains coincide with the GPU domains, ensuring that communication between the GPU-based portion of the model and the RT scheme is confined on the node.

The MPAS-A performance data shown in Figure 1 were collected on the Summit supercomputer at the Oak Ridge National Laboratory’s Leadership Class Facility using the version 19 PGI compiler. Summit’s AC922 nodes consist of six NVIDIA Volta GPUs, each with 16 GB of onboard memory and two IBM Power 9 processors. The reference performance was obtained using the Intel 19 compiler running on the Cheyenne supercomputer’s dual socket, 18-core Intel Xeon Broadwell nodes. The MPAS-A model test case configuration consisted of single precision, 56 levels, with 6 moist tracers.

MPAS-A optimization
Figure 1: MPAS-A weak scaling (left) on an IBM AC922 cluster (the Summit System at Oak Ridge National Laboratory) that has 6 NVIDIA V100 Volta GPUs per node. Here the horizontal patch size is held to 40,960 horizontal points per node (diamonds) and 81,920 points per node (squares). Log-log plot “strong scaling curves” (right) for MPAS-A on CPU and GPU-based architectures for single precision, 56 levels, NVIDIA V100 GPUs (10 km, orange; 5 km, green; 3 km, brown); and for nodes of the Cheyenne supercomputer at NCAR. The largest configuration used in these experiments was 4,200 GPUs running at 3 km on Summit.

MURaM

MURaM is a magnetohydrodynamic (MHD) code designed to perform realistic simulations of the solar atmosphere. To accomplish this, MURaM includes the physical effects of non-gray radiative transfer, partial ionization, full compressibility, and open boundary conditions. MURaM performance improvements were required to meet HAO’s science goals. These science drivers included increased resolution to match improved solar observational capabilities of the Daniel K. Inouye Solar Telescope, which is expected to come online in 2020; scientists’ desire to achieve a forecast capability; and simulations of the full solar atmosphere that require significant increases in computing power, especially for the transitional region known as the chromosphere.

The effort to refactor the MURaM model for GPUs using OpenACC pragmas made substantial progress during the fiscal year. The project team went through several profiling-optimization-reprofiling iterations on the code. The major FY2019 accomplishments in porting and optimizing MURaM on GPUs included: 

  1. Accelerating the most computationally intense radiative transport (RTS) kernel on a GPU device under the OpenACC framework, achieving a 70x speedup relative to one Intel E5 6140 Skylake core. 

  2. Optimizing the data structure and workflow across the whole code. 

  3. Implementing the new, highly parallel algorithm for the RTS kernel.

  4. Introducing the latest PGI compiler features, such as auto comparison between GPU and CPU kernel executions to locate errors.

The result was significant speedup of the major code components. Comparing one NVIDIA V100 GPU to one core of an Intel Xeon Skylake processor showed the RTS kernel achieved a 70x increase; the MHD kernel was 54x faster; and the total variation diminishing kernel reached 35x the speed of the reference CPU core. These three kernels collectively consumed more than 55% of the total wall time. The MURaM refactoring was supported by NSF Core funds and $25,000 of NASA funding.