Foster research and technical collaborations

CISL has a robust set of ongoing partnerships and collaborations that are focused on the effective use of current and future high performance architectures for NCAR applications. These collaborations take the form of membership in a regional HPC consortium, ongoing research and development projects that include vendor, industry, and research laboratory partners, annual workshops, symposia, hackathons, training events focused on emerging technologies and techniques, as well as regularly scheduled teleconferences on code optimization with various vendor partners.

Rocky Mountain Advanced Computing Consortium

CISL’s participation in the Rocky Mountain Advanced Computing Consortium (RMACC) not only supports the development of regional high-performance cyberinfrastructure but also broadens and informs CISL’s space of collaborative opportunities, serves as an education and outreach vehicle, and helps to increase the region’s collective knowledge about and capabilities in HPC technologies. As a long-standing member of RMACC, NCAR, through CISL’s participation, contributes to the success of the annual, two-day RMACC High Performance Computing Symposium, which attracts hundreds of faculty, students, and vendors to share research results and gain knowledge about HPC technologies and trends.

Vendor partnerships

CISL maintains a wide spectrum of strategic research and development partnerships, many of which include vendor participation, and in some cases, funding. These activities enable CISL to gain early access to and evaluate emerging technologies, to do advance software development and optimization so critical software components – like NCAR’s community models – will be able exploit them. This work is done on an open source basis that furthers the scientific goals of the scientific communities at Universities and NCAR. Examples of these activities include CISL’s role in Intel Corporation’s Parallel Computing Center for Weather and Climate Simulations (IPCC-WACS) to develop tools and techniques for getting more performance from Intel Xeon and Xeon Phi processors; a partnership with NVIDIA Corporation (and others) called Weather and Climate Alliance focused on enabling model portability to GPUs and machine learning; a Joint Development Agreement (JDA) with IBM Corporation and its subsidiary, the Weather Company to port MPAS to the IBM Power-server architecture; and with SGI, Inc., (now Hewlett Packard Enterprise Company) Joint Center of Excellence (JCoE) to explore ways to improve CISL’s production computing environment with HPE products.

Finally, CISL uses the HPC Futures Lab facility and support from CISL’s Supercomputing Services Group to acquire, deploy, and support the evaluation and development systems required to perform these collaborative research and development efforts.

Intel Parallel Computing Center for Weather and Climate Simulation (IPCC-WACS): In FY2017 CISL, and its collaborator the University of Colorado at Boulder (CU), continued their Intel-funded collaboration as the IPCC-WACS. This collaborative center promotes the discovery of new methods for optimizing the performance of weather and climate models on Intel Xeon and Xeon Phi hardware and accelerates the adoption of these optimizations back into key weather and climate community models. IPCC-WACS also has a student education and training component being led by CU.

The Intel gift has enabled CISL to develop the Kernel Generator (KGEN), a labor- and resource-saving tool for automatically extracting part of a large modeling code base and creating a kernel or unit test around it for optimization and subsequent verification. The KGEN tool is being evaluated and used by a steadily growing community of scientists and engineers. The number of institutions using or evaluating KGEN increased from 12 to 17 in FY2017 with new active engagements from diverse research institutions ranging from the Indian Institute of Technology in Mumbai, India; Lawrence Livermore National Laboratory; The University of Wyoming; and private companies such as Engility Coroporation and ParaTools, Inc. Meanwhile, our university collaborator CU hosted Xeon Phi training during the 2017 RMACC HPC Symposium (see above) to train students to take advantage of Intel’s many-core architectures. The value of these impacts of IPCC-WACS was recognized by Intel, who provided additional funding in early 2017 after reviewing the progress and accomplishments of the IPCC-WACS team.

Also in FY2017, the IPCC-WACS team completed optimization work focused on the spectral element dynamical core in CESM, called HOMME, which is now being both backported to CESM Version 1.3 Build 17 to support important high-resolution climate experiments and is scheduled for integration with CESM in 2018. New work in 2017 by the CISL IPCC-WACS team included profiling and developing optimization strategies for the Data Assimilation Research Testbed (DART) to enhance its scalability and performance.

IPCC-WACS also proposed and developed a new focus in Machine Learning. Working with an ASP post-doc in the Research Applications Laboratory (RAL), a collaborative project to evaluate Generative Adversarial Neural Networks (GANs) for unsupervised learning. This project made heavy use of the Cirrascale GX8 system and Intel Xeon equipment in the CISL HPC Futures Lab (both described below).

The NCAR/CISL portion of the IPCC-WACS project is funded by a $140,000 corporate gift from Intel Corporation with matching by NSF Core funds.

Weather and Climate Alliance (WACA): WACA Phase 1’s goal was to develop a performance-portable version of the Model for Prediction Across Scales - Atmosphere (MPAS-A) using directive-based OpenACC and OpenMP parallel programming techniques. By using directive systems like OpenACC and OpenMP, we hope to demonstrate that a single, maintainable version of the MPAS code can use both conventional microprocessors and GPUs. For FY2017, CISL successfully completed Phase 1 of its partnership with NVIDIA Corporation, the NVIDIA GPU Research Center (GRC) at the University of Wyoming, the Mesoscale and Microscale Meteorology (MMM) Laboratory at NCAR, and the Korean Institute for Science and Technology Information (KISTI).

For NCAR, the WACA Phase 1 objective was to create a GPU-enabled MPAS-A dry dynamical core that (a) gave correct answers; (b) was refactored in a way understandable to NCAR’s MPAS-A core team in the MMM Laboratory; (c) scaled well to multiple GPUs; and (d) ran at least as fast as the original MPAS-A code on Intel Xeon and Xeon Phi processors. For FY2017, only the third of these objectives was not met, for technical reasons to do with how memory is managed in the communication component of MPAS-A. This issue is now a principal focus area for the IBM Joint Development project described below.

Importantly, WACA Phase 1 was the first-ever creation of a GPU-based kernel of a significant portion of an NCAR community model. Also, KISTI made significant progress developing MPAS-A physics components. WACA Phase 2 goals include completing the port of the MPAS-A physics packages, then integrating this work into a complete GPU-based MPAS-A application.

WACA has enabled students at the University of Wyoming to gain valuable hands-on experience with HPC application software and optimization techniques. Our WACA team which includes NVIDIA Corporation, Portland Group, Inc. (PGI), KISTI, and CISL staff came together for another OpenACC Hackathon in July 2017. This event provided hand-on experience with application porting using PGI OpenACC directives. The hackathon was attended by about a dozen students and several NCAR staff.

The NCAR and UW portions of WACA are partially funded by a $50,000 contract with NVIDIA Corporation, and the majority of the support is from NSF Core funds.

IBM Corporation/The Weather Company Joint Development Agreement: In FY2017 NCAR embarked on a new, one-year collaboration with IBM Corporation and the Weather Company (a subsidiary of IBM) to: (a) integrate the components of MPAS-A that were ported to the GPU using OpenACC under WACA Phase 1 and Phase 2 (see above); and (b) port them to the IBM Power Server S822LC platform. This architecture – named “Minsky” – is a heterogeneous node consisting of two IBM POWER-8 processors and four NVIDIA Tesla Pascal (P100) GPUs. Minsky is an attractive platform for porting and integrating MPAS-A for two reasons. First, the CPU and GPU components of Minsky are interconnected by NVLINK, a proprietary communication technology with significantly higher bandwidth than PCIe Gen3. This allows tight coupling between the IBM POWER CPU and the GPU components of the Minsky node. Second, combining PGI’s Unified Memory technology enables a single, logical memory space (rather than two in other systems) with high-bandwidth NVLINK fabric. This enables effective error-free migration of model components from the CPU to the GPU while maintaining decent application performance throughout.

The collaboration with IBM and the Weather Company effectively expands the team focused on MPAS-A (see project diagram), and gives the project the driving focus of an end user (the Weather Company). It is important to note that none of the modifications to MPAS are expected to be in any way specific to the Minsky architecture or its successors, and the resulting open-source code, if the port successfully meets the five criteria established in WACA (see above), is planned for distribution in the publicly released versions of MPAS-A.

The IBM/TWC JDA activity provides $200,000 of funding to CISL, and is also supported with cosponsored staff time from NSF Core funds.

Cirrascale GX8 platform: In 2016, CISL acquired a Cirrascale GX8 dense-GPU solution that puts O(10) teraflops in a 4U chassis with a scalable PCI-switch interconnect. This GX8 system has formed the basis of multiple research projects including: development of GPU-based machine-learning-algorithm research projects; and development of multiple GPU-enabled 2D and 3D atmospheric PDE solvers, such as MPAS-A’s Finite Volume approach (see WACA Phase 1 above), Discontinuous Galerkin DG methods based on 3D DG elements, and 2D Radial Basis Function Finite Difference (RBF-FD) methods. Funding for the Cirrascale GX8 came from NSF Core funds, jointly supplied by CISL’s Technology Development and Operations and Services Divisions.

NCAR-SGI Joint Center of Excellence: In 2016, Silicon Graphics’ successful proposal response to the NWSC-2 RFP included the suggestion to create a “Joint Center of Excellence” (JCoE) between NCAR and SGI. In 2017, SGI was acquired by Hewlett Packard Enterprise Company. However, the JCoE has continued to pursue joint research and development activities and to provide a forum for identifying and discussing strategically important issues of mutual concern. The collaboration fostered by the JCoE partnership helps NCAR in designing future analysis systems that can process large data sets more efficiently. In turn this data helps HPE evaluate future system design concepts for the wider HPC marketplace. Working through the JCoE thus enabled mutually beneficial outcomes for NCAR and HPE.

One significant project related to using the SGI UV 300 platform and FLASH-based NVMe storage to accelerate NCAR climate workflows was completed in FY2017. The UV 300 is a large shared-memory architecture designed to handle large-scale, data-intensive analytics workflows. CISL’s Application Scalability and Performance Group worked with SGI engineers to evaluate climate workflow benchmarks on UV 300 systems containing NAND-based SSD storage. The use of SSD storage as a high-speed cache enabled these global climate data post-processing workflows to speed up by factors of three to five at both high (0.25 degree) and low (1 degree) resolutions.

The JCoE is based on cosponsored staff time, which is provided on NCAR’s side from NSF Core funds.

HPC Futures Lab

CISL’s HPC Futures Lab (HPCFL) and CISL’s Supercomputing Services Group play a critical role in supporting technology evaluation activities required by computational science research and development. For example, the Cirrascale GX8 evaluation activity and Intel Xeon Phi servers cited here are entirely enabled by support received through the HPCFL and SSG teams.

Funding

Support for HPCFL comes via NSF Core funds, jointly supplied by CISL’s Technology Development Division and Operations and Services Division.