Optimize model performance for current and future supercomputers

In FY2016 CISL continued to augment its efforts to optimize NCAR codes, and it focused first on NCAR’s flagship community models. This strategic optimization thrust is two-pronged, with one effort (called SPOC) aimed at optimizations of current model code bases for Yellowstone-like systems (i.e., conventional multi-core processors) and a second (called IPCC-WACS) housed in TDD’s ASAP group focused on the future challenges of the accelerator space. The SPOC effort is described below, and the IPCC-WACS effort is described in the section titled Explore many-core and accelerator-based architectures.

In recent years, the amount of performance that can be extracted from supercomputers through software optimization has become at least as important as that coming from hardware improvements. Significant factors driving this trend include the stagnation or even reduction of the speed of a single thread of execution, the aggressive introduction of vector/SIMD instruction sets, the increased-core-count-per-processor socket that requires careful parallel programming to properly utilize, and the introduction of heterogeneous architectures composed of both conventional processors and accelerator coprocessors.

Strategic Parallel and Optimization Computing (SPOC) initiative

In FY2016, CISL’s SPOC initiative continued its NCAR-wide efforts to increase the performance and efficiency of NCAR’s community codes — CESM, WRF, and MPAS — on Yellowstone. In addition to benefits on current Yellowstone hardware, the SPOC efforts are targeting code optimizations that are expected to translate to performance benefits on future processor architectures. In addition to support within the Consulting Services Group (CSG), CISL identified additional resources for this work and embedded them directly with the model development teams. Key activities this year include:

  • In FY2016, a CSG-CGD collaboration optimized CSLAM, a new transport scheme being implemented within the spectral element dynamical core of CAM (CAM-SE). CSLAM targets model runs with a large number of tracers, and prior to optimization, CAM-SE-CSLAM outperformed traditional CAM-SE only when using more than 40 tracers in a test configuration. The SPOC effort focused on improving serial code performance by optimizing loop structures, improving memory access patterns, and reorganizing code for better vectorization. In the end, they moved the crossover point from the original 40 tracers to 18 tracers, surpassing the initial goal of 30 tracers.

  • With SPOC support, two former SIParCS students hired as student assistants for User Services have continued to pursue WRF optimizations. In one project, a student continued investigating the effects of domain decomposition and especially processor binding and hybrid parallelism (MPI and OpenMPI) to WRF scaling. The other student developed performance optimization code changes to WRF advection, demonstrating very significant speed-ups in synthetic kernels. She is now working to apply those changes to the production code.

  • In a related vein, along with the procurement of the Cheyenne system, NCAR and SGI also formed a Joint Center of Excellence focused on optimization and performance improvement activities in FY2016. Through these efforts, NCAR and SGI will collaborate to optimize the operation of the Cheyenne system; to port, tune, and optimize applications for the Cheyenne environment; and to prepare NCAR models and the SGI hardware and software ecosystem for future and emerging HPC technologies.

  • Training has also been identified as a key contribution from the SPOC initiative toward building the relevant skills in the NCAR developer community. To that end, CISL hosted vendor-led training events, one by Intel about their analysis tools and compilers, and another by Allinea that introduced their debugging and profiling tools.

The SPOC initiative is supported by NSF Core funds.