Explore meshless numerical methods for modeling

RBF performance
Performance of the RBF-FD shallow water equation (SWD) solver, measured in GFLOPS, as a function of the eight Nvidia K40 GPUs connected over a PCIe bus in a Cirrascale GX-8 system.
RBF communication overhead
MPI communication overhead of the RBF-FD SWE solver using the same Cirrascale GX-8 multi-GPU system with eight Nvidia K40 GPUs. The solver demonstrates excellent scaling with less than 7% of the time spent in MPI communication and getting close to 1 TFLOPS of performance on a single multi-GPU system.

Radial basis functions (RBFs) offer a novel numerical approach for solving partial differential equations to high accuracy. Being a meshless method, RBFs excel in solving problems that require geometric flexibility and local refinement for small features. Further, RBFs require very little increase in programming complexity when problems are extended to higher dimensional spaces. In particular, the RBF-generated finite differences (RBF-FD) approach has allowed the RBF method to become computationally cost-effective in terms of scalability, memory, and runtime for solving systems of PDEs.

The localized and accurate nature of the RBF-FD method:

  • Leads to matrices that are over 99% empty.

  • Allows it to scale as O(N) per time step, with N being with the total number of nodes.

  • Makes it highly suitable for parallelization on accelerator-based computer architectures.

Development in FY2016 focused on parallelization for computer architectures using hardware accelerators. Scalability, performance, and portability are essential when developing HPC algorithms. Many of the current standard atmospheric models lack particularly in the aspect of portability. In developing the RBF-FD solver for the shallow water equations (SWE) on the sphere, RBF development targeted today's three dominant HPC architectures: Intel Multicore, Intel Manycore, and Nvidia GPUs. To tackle portability of the solver on all three architectures, the directive-based OpenACC and OpenMP languages for simple shared memory parallelization were used. MPI was used for distributed-memory parallelization to address the scalability of the solver. This allowed for a single-source implementation requiring only a simple recompilation to run on practically any HPC system today.

Excellent performance was demonstrated on the Intel Multicore and Manycore systems. However, with regard to MPI implementation, optimizing GPU systems presented the most challenging task. The figures above show the high performance – close to 1 TFLOPS – and the minimal MPI communication overhead – less than 7% – of the RBF-FD SWE solver on a eight-GPU system.

This work advances CISL’s scientific efforts to develop scalable algorithms for atmospheric modeling on massively parallel and accelerator-based computer architectures. Development of numerical algorithms based on meshless methods for atmospheric modeling at NCAR is supported by NSF Core funds.