Refactor NCAR applications for performance portability

CISL continues to make strategic investments in the computational efficiency of applications that are critical to NCAR’s fundamental scientific mission. In FY2018, these investments focused on improving the efficiency of the Community Earth System Model (CESM) and the Data Assimilation Research Testbed (DART) on Cheyenne and other CPU-based architectures as well as porting and optimizing of Model Prediction Across Scales-Atmosphere (MPAS-A) on GPU architectures. These efforts will increase the amount of NCAR science that can be done on existing infrastructure and potentially enable significant increases in what can be done on future computing hardware. This work was enabled in FY2018 by core National Science Foundation funding, a $140,000 gift from Intel Corporation through the Intel Parallel Computing Center for Weather and Climate Simulations, and $75,000 from NVIDIA Corp. under a Weather and Climate Alliance subcontract.

CESM optimizations

A sustained effort to reduce the CESM model’s execution time has been under way for a number of years. It is possible to gauge the overall impact of this code optimization effort by measuring the current CESM2 execution time in detail and then estimating the potential cost had this code optimization effort not been undertaken. CISL estimates this sustained effort has reduced the total cost to run a Coupled Model Intercomparison Project configuration of CESM2, which was released in FY2018, by approximately 13%. The reduction in costs varies for different scientific configurations. The cost of a low-resolution Whole Atmosphere Community Climate Model chemistry configuration, for example, was reduced by 20% while a high-resolution Atmosphere IPCC-like configuration was reduced by 25%. The most computationally expensive CESM2 configuration, which uses an ultra-high-resolution eddy-permitting ocean model coupled to a high-resolution atmosphere model, was reduced in cost by 45%. Based on these improvements and knowledge of the total cost of Cheyenne, the optimization effort will allow $3.7 million to $12.6 million of additional scientific throughput from Cheyenne over four years.

Computational cost comparison

Figure 1. The computational cost of ultra-high-resolution CESM1.3 configuration, which uses a 25-km atmosphere model coupled to a 0.1 degree ocean and sea-ice model.

Figure 1 illustrates the computational cost of high-resolution CESM1.3, which uses 25 km atmosphere model and land surface components coupled to ultra-high-resolution 0.1 degree ocean and sea ice models. The cost in node-hours/simulated year is nearly flat (total-opt) regardless of node count. The significant improvement versus the original code is due to improvements in the boundary exchange module used in the CICE and POP models. The improvements were due primarily to the realization that a particular message-passing buffer was 10 to 40 times larger than necessary on a subset of the total number of MPI ranks. Optimized versions of the boundary exchange modules minimized the total size of these buffers and reduced the total execution time of CICE on 14,000 Cheyenne cores by approximately 56%. Optimization also reduced the execution time of POP on 7,000 cores by approximately 30%. The total impact of the boundary exchange optimizations will reduce the simulation cost of ultra-high-resolution CESM1.3 by approximately 20%.

DART optimizations

CISL made significant improvements in FY2018 to performance of the Data Assimilation Research Testbed (DART), a computationally demanding application whose computational bottleneck varies by scientific configuration. Memory has been a bottleneck for large models, such as a high-resolution POP ocean model. Distributing the model ensemble mean across all processors reduces memory use but increases off-node communication. CISL determined that both memory and communication could be reduced by distributing the complete ensemble mean across processes on each compute node. Initial results showed this reduces communication time by up to 90%, a significant reduction for large models.

Also in FY2018, CISL developed an alternative DART assimilation loop that groups independent observations into colors through the use of a graph-coloring algorithm. Where the previous loop required global synchronization in the form of a broadcast for all MPI ranks, the alternative reduces the required global synchronization from the number of observations to the number of colors. For a 192K observation data set, this coloring reduced the total number of synchronizations needed by several orders of magnitude. Timings of the data assimilation loop on a 100-member ensemble on 384 nodes revealed a nearly 2.5x speedup compared to the existing algorithm.

GPU-enabled Model Prediction Across Scales - Atmosphere (MPAS-A)

CISL played a vital role in FY2018 in optimizing the performance of MPAS-A, a general circulation global model of the Earth’s atmosphere. As a part of NCAR's Weather and Climate Alliance project, CISL helped develop a portable, multi-GPU MPAS implementation using OpenACC. CISL and other NCAR staff participated in the project, along with University of Wyoming faculty and students, collaborators at the Korean Institute of Science and Technology Information, IBM, a group of NVIDIA & NVIDIA PGI developers, and The Weather Company.

Benchmarks of the OpenACC version of the MPAS-A dynamical core indicate that it is possible to maintain a single source code for both Intel and NVIDIA architectures while retaining good performance on both. Benchmarks also showed that Volta GPUs can perform up to 4x faster than fully subscribed, dual-socket, 36-core Broadwell nodes. Multi-GPU dynamical core runs indicate good scalability with optimization for better performance. The MPAS-A physics has been ported and shows varied performance between 0.9x for routines that take up less than 1% of the total MPAS execution time to 3.2x for routines that take more time.