Optimize model performance for current and future supercomputers

Throughout FY2017 CISL continued to augment its efforts to optimize NCAR codes, and it focused primarily on NCAR’s flagship community models. This strategic optimization thrust is two-pronged, with one effort (SPOC, described below) aimed at optimizations of current model code bases for modern conventional supercomputers. The second (called IPCC-WACS) is housed in TDD’s ASAP group and focuses on the future challenges of accelerator architectures. The SPOC effort is described below, and the IPCC-WACS effort is described in the section titled Explore many-core and accelerator-based architectures.

In recent years, the amount of performance that can be extracted from supercomputers through software optimization has become at least as important as that coming from hardware improvements. Significant factors driving this trend include the stagnation or even reduction of the speed of a single thread of execution, the aggressive introduction of vector/SIMD instruction sets, the increased-core-count-per-processor-socket that requires careful parallel programming to properly exploit, and the introduction of heterogeneous architectures composed of both conventional processors and accelerator coprocessors.

SPOC benefits supercomputing
This graphic shows the SPOC initiative’s targeted outcome of improving the performance of NCAR’s core atmospheric models by implementing modern software engineering methods and practices. This reduces the computational demands on current HPC systems, maintains all scientific integrity, and will help NCAR realize significant savings in future supercomputer procurements.

In FY2017, CISL’s Strategic Parallel and Optimization Computing (SPOC) initiative continued its NCAR-wide efforts to increase the performance and efficiency of NCAR’s community codes – CESM, WRF, and MPAS – on both Cheyenne and Yellowstone. In addition to benefits on current supercomputer hardware, SPOC efforts targeted code optimizations that are expected to translate to performance benefits on future processor architectures. In addition to support within the Consulting Services Group (CSG), CISL identified additional resources for this work and embedded them directly with the model development teams. Key activities this year include:

  • A multi-discipline CISL team led by CSG worked with the WRF Development Team to port the WRF Testing Framework (WTF) to Cheyenne. The SPOC effort identified and addressed numerous challenges and bottlenecks to improve WTF’s demanding build-test cycle. Areas that were addressed included internet transfers speeds from external host sites, efficient WTF compile-test job scheduling, intra-node memory bandwidth, memory oversubscription, and tuning Cheyenne’s batch job queue structure and internal job scheduling algorithms.

  • A CSG-CGD collaboration conducted a load-balancing study on a suite of standard CESM2 test cases. The goal of the investigation was to determine optimized processor layout on Cheyenne for each of the the models’ compset and grid configurations. A total of seven compsets were studied. Modest improvements over the default configurations were realized in several compsets. The study concluded that the test cases provided by CGD were well balanced for Cheyenne and no additional fine tuning was required or recommended. Larger test cases will be examined in the coming year.

  • NCAR and SGI/HPE continued to collaborate through the Joint Center of Excellence that is focused on application optimization and performance improvement activities. Through these efforts, NCAR and SGI/HPE will collaborate to optimize the operation of the Cheyenne system; to port, tune, and optimize applications for the Cheyenne environment; and to prepare NCAR models and the Cheyenne hardware and software ecosystem for future and emerging HPC technologies.

  • Training has also been identified as a key contribution from the SPOC initiative toward building the relevant skills in the NCAR developer community. To that end, CISL hosted vendor-led training events, one by Intel about their analysis tools and compilers, and another by Allinea that introduced their debugging and profiling tools.

The SPOC initiative is supported by NSF Core funds.