Accelerating applications algorithmically

New high-performance computing algorithms are needed to produce increasingly ambitious simulations for the Earth System sciences. Because each doubling of spatial resolution requires a 16-fold increase in computational cost, increases in raw computing power alone will not be sufficient to address the grand-challenge problems we face. To address this issue, CISL is developing new numerical methods, solvers, and time-integration schemes for the algorithms used for Earth System simulations. Another approach is to reduce the computational complexity of the simulation by taking longer time steps or by using fewer grid points. CISL is reporting two highlights in this area for FY2015.

An efficient limiter for spectral-element and discontinuous Galerkin models

Efficient tracer transport schemes with monotonic or positivity-preserving properties are extremely important for climate models. For a practical climate model, hundreds of tracers (chemical species) need to be transported for a very long period of time to study the evolution of these tracers in the atmosphere.

WENO limiter effect
Simulation results with the DG transport scheme for a non-smooth tracer field (step cylinder located on a face of the cubed-sphere), after one revolution along the north-east direction of the cubed-sphere. The left panel shows results without using a limiter and the right panel shows the same but with the WENO limiter. The new WENO limiter successfully removes spurious oscillations and keeps the solution positive-definite.

To remove spurious oscillations in the numerical solution and make the solution physically recognizable, a numerical limiter is usually combined with a transport algorithm. The High-Order Method Modeling Environment (HOMME) developed at CISL employs the spectral-element (SE) and discontinuous Galerkin (DG) methods for the spatial discretization. The current default dynamical core in the Community Atmosphere Model (CAM) is based on the SE variant of HOMME (CAM-SE). For atmospheric models based on high-order methods such as SE or DG, the tracer transport is very challenging because efficient limiters are typically unavailable. For this reason the finite-volume based transport schemes are considered for tracer transport in CAM-SE. However, this requires a uniform resolution grid overlaid with the native SE grids, and it leads to complex flux coupling between grid systems.

It is difficult to design limiters for both high-order accuracy and non-oscillatory properties, and the problem is even more challenging for the cubed-sphere geometry. In FY2015, a simple and efficient limiter based on the weighted essentially non-oscillatory (WENO) methodology was developed for DG or SE transport models on the cubed sphere. The uniform high-order accuracy of the resulting scheme is maintained because of the high-order nature of WENO procedures. Unlike the classic WENO limiter, for which the wide halo region may significantly impede parallel efficiency, the new limiter requires only the information from the nearest neighboring elements, without degrading the inherent high-parallel efficiency of the DG or SE scheme. A local bound-preserving (BP) filter can be further coupled in the scheme to guarantee the highly desirable positivity-preserving property for the numerical solution. When combined with the WENO limiter, the DG/SE transport scheme is mass conservative, high-order accurate, non-oscillatory, and positivity preserving for the model based on the cubed-sphere geometry. The new limiter can be implemented in the CAM-SE framework to support climate and atmospheric chemistry applications using the native DG/SE transport scheme.

The figure above shows numerical results for the DG transport on the cubed-sphere using a benchmark solid-body rotation test, where the initial field should return to its original position without incurring any deformation, after a complete revolution (equivalent to a 12-day period around the planet). The initial scalar field is a non-smooth step cylinder with sharp edges. The unlimited solution (left panel) shows spurious oscillations including negative undershoots. The WENO limiter combined with the BP filter completely removes the oscillations (right panel) and preserves the original (initial) shape of the tracer field.

WENO limiter effect
Convergence plots of a fourth-order accurate DG transport scheme with (blue) and without (red) the WENO limiter for a smooth tracer field. A limiter is an essential component of the tracer transport model; it keeps the tracer monotonic (positivity preserving) throughout the numerical simulation. The new limiter does not degrade the accuracy of the scheme, conserves the tracer mass, and maintains the tracer free from spurious oscillations.

A remarkable feature of the new limiter is its selective application, the limiting operation is only activated when the solution is oscillatory. This not only saves computing time but also maintains high-order accuracy of the smooth solution. The figure at right shows the convergence of the fourth-order DG scheme with and without applying the limiter for a smooth Gaussian tracer field.

This work advances CISL’s science frontier in algorithmic acceleration by developing new algorithms and computational approaches to produce simulations capable of addressing grand challenges. Specifically, it fulfills a strategic action item to accelerate applications algorithmically by developing new numerical methods, transport schemes with limiters, new solvers, and new time integration schemes.

This work is supported by NSF Core funding. In addition, CISL’s RSVP funding was made available to support a graduate student visitor on this project.


Implementing PDE solvers on multi-CPU-GPU systems based on Radial Basis Function - Finite Difference methods

In FY2015 we investigated a multi CPU-GPU framework for PDE solvers based on the radial basis function-finite difference (RBF-FD) method. As a generalized finite differencing scheme, the RBF-FD method functions without the need for underlying meshes to structure nodes. It offers high-order accuracy approximation and scales as O(N) operations per time step, with N being with the total number of nodes. Although the meshless nature of RBF-FD with scattered node layout can provide challenges to domain partitioning and communication among processors, the results were highly encouraging. The test case was the transport of a scalar quantity (the amount of a pollutant) in a strong vortex fluid flow. The first figure shows the solution at a point when the scalar quantity is most highly sheared by the flow and a domain partitioning among eight processors for a scattered node layout. (The PDE is solved at each of the nodes using the RBF-FD method.)

PDE solution method
The image on the left shows a solution at a point when the scalar quantity (contour lines represent amount) is most highly sheared by the flow. The image on the right used the METIS software tool to partition a domain for a scattered-node layout among eight processors that will solve the PDE. Each partition is shown in a different color within the black boundary.

The study concluded that:

  • On a single GPU, the RBF-FD method could produce a 35-times speedup over a serial code on a single CPU.

  • On multi-CPU-GPU architectures, there is a general bottleneck in communication, with only a 15-times speedup using two CPUs and two GPUs and a 25-times speedup using four CPUs and four GPUs. Only until eight CPUs and eight GPUs are used does the achieved performance gain for the computation overcome the communication costs, with a 45-times speedup. This is demonstrated in the figure below.

Parallel processing speedup
Speedup over the serial RBF-FD code (on one CPU) for transport of a scalar quantity in a strongly sheared flow with regard to one GPU and multiple CPU/GPU platforms. The notation MPI_CUDA n uses n CPUs and n GPUs (e.g., MPI_CUDA 4 uses 4 CPUs and 4 GPUs). The runs were performed on the distributed-memory Caldera cluster.

This work advances CISL’s scientific efforts to develop algorithms for computationally accelerating applications of NCAR-wide science. This development effort at NCAR is supported by NSF grant DMS-0934317.