Evaluate, deploy, and maintain best-in-class HPC services

CISL’s high-performance computational (HPC) environment includes petascale supercomputing resources, high-performance centralized file systems, disk- and tape-based long-term storage systems, and data analysis and visualization (DAV) resources. CISL’s efforts are focused on providing robust, reliable, and secure HPC and data storage resources in a production research environment.

The efficient design, operation, and maintenance of NCAR’s data-centric HPC environment support the scientific productivity of its user community. CISL continuously monitors system workload, utilization, and performance, and balances resource allocation through priority-based, intelligent job scheduling, a job queue structure tuned to users’ needs, and single-job resource limits – all of which enables high, efficient resource utilization and rapid job turnaround. CISL’s continued leadership in providing discipline-focused computing and data services is critical in supporting NCAR as a national center.

Computing and analysis services

In FY2018, CISL continued operating the Cheyenne supercomputer, which provided significant value throughout the fiscal year, serving up over 970 million core-hours while executing more than 11.7 million jobs. Cheyenne is an SGI ICE-XA system with 4,032 nodes, 145,152 Intel Xeon E5-2697v4 (Broadwell) processing cores, 315 TB of memory, and peak processing power of 5.34 petaflops (5.34 quadrillion floating point operations per second). With a service lifetime spanning calendar years 2017 through 2021, Cheyenne’s design and configuration provide exceptional and balanced I/O and computational capacity for the data-intensive needs of NCAR’s user community.

CISL deployed, tested, and accepted NCAR’s new DAV system, Casper, near the end of the fiscal year. It was released to users on October 3, 2018. The Casper system was acquired from PCPC Direct, Ltd., and will replace the nearly six-year-old IBM Geyser and Caldera clusters, which CISL will decommission at the end of the calendar year. The Casper cluster’s hardware consists of a total of 24 Supermicro nodes featuring Intel Xeon Gold (Skylake) processors. Four nodes are provisioned with dense configurations of NVIDIA Tesla V100 GPUs and large memory to support explorations in machine learning (ML) and deep learning (DL) in atmospheric and related sciences.

Cheyenne’s predecessor HPC system, Yellowstone, was decommissioned at the end of December 2017 after serving the user community for more than five years. Yellowstone was one of the world’s first petascale systems, with peak performance rated at 1.54 petaflops. It was the first petascale system that CISL deployed and the first HPC system put into production at the NCAR-Wyoming Supercomputing Center (NWSC).

HPC resource availability & utilization profile
NCAR’s Cheyenne supercomputer and data analysis and visualization resources were well utilized during the fiscal year despite issues that reduced Cheyenne’s average system availability below CISL norms. Investigation and resolution of an InfiniBand fabric stability issue caused an extended downtime in November 2017. A facility-wide NWSC power outage at the beginning of the calendar year and a four-day downtime in March for replacement of an electrical transformer also had an impact.

Central storage environment and services

The central Globally Accessible Data Environment (GLADE) file system supports the efficient management, manipulation and analysis of data, which is crucial for progress on the grand challenges of numerical modeling in Earth system science. Every user of CISL’s HPC environment benefits from its data-centric configuration.

CISL augmented GLADE’s capacity and capability in FY2018 by adding 0.45 PB of high-IOPS solid state drives (SSDs). CISL also installed a new version of IBM Spectrum Scale (formerly GPFS) on the GLADE central storage system. This major update greatly enhances the ability to support both small and large files within a single parallel file system and provides optimal performance for both types of usage. While parallel file systems generally have been targeted for large files, model runs in practice have a mixture of many small files along with several large output files. This optimization provides near equal performance from the file system for all file types.

CISL also procured and deployed the new 20-PB, disk-based Campaign Storage system in FY2018 to accelerate user workflows over the course of projects lasting two to five years. This storage system allows for archival of data for up to five years, at which time it will be purged. Allocations have been made and the system went into production in early July. As part of the effort to centralize data-transfer services around Globus technologies, the Campaign Storage system is accessible only through the Globus interface. Networks are optimized for maximal data-transfer rates between the GLADE file system and the new Campaign Storage archive.

CISL also completed an evaluation of the new Linear Tape-Open (LTO) tape technology for its High Performance Storage System (HPSS) archive and concluded that it is a cost-effective solution for the deep archival tier of NCAR’s storage architecture. CISL is now deploying the new LTO technology, with full production planned for FY2019.

Evaluating future technologies

In FY2018, CISL began evaluating Globus as a common interface across all of the archival tiers of its high-performance storage architecture. The evaluation is scheduled to be completed in early FY2019. In a collaborative effort, CISL and the University of Colorado’s Peta Library program are testing the Globus interface to CISL’s HPSS system as part of this evaluation. The addition of Globus as an HPSS interface will provide a single data-transfer mechanism to all storage systems.

Procurement activities and roadmap

In June 2018, CISL kicked off the procurement effort for a follow-on to the Cheyenne system. As the third petascale system destined for the NWSC, the future system is referred to as “NWSC-3.” CISL and NCAR colleagues started the early steps of defining the key science and technology drivers for NWSC-3. CISL anticipates releasing a formal Request for Proposal in early FY2020, with an award to be made late that year. The NWSC-3 system is planned for production availability in July 2021.

CISL also started the initial project scoping, schedule coordination, and process for NWSC facility infrastructure upgrades required to support the NWSC-3 system. While the supercomputer-specific facility enhancements must be delayed until after vendor selection, other facility fit-up work not specific to a particular vendor’s solution can begin before the NWSC-3 award.