Evaluate, deploy, and maintain best-in-class HPC services

The high-performance computing (HPC) environment at NCAR’s Computational and Information Systems Laboratory (CISL) comprises the following integrated systems:

  • Cheyenne, a petascale supercomputing system

  • Casper, a data analysis, visualization, and machine learning resource that was made available to users at the beginning of FY2019

  • The Globally Accessible Data Environment (GLADE), a high-performance, centralized file system

  • Campaign Storage, a mid-performance, warm archival tier 

  • The High-Performance Storage System (HPSS), a cold archival tier

CISL provides robust, reliable, and secure HPC and data storage resources in a production research environment and supports that environment for the thousands of users and hundreds of projects spanning NCAR, universities, and the broader atmospheric science community. CISL continuously monitors system workload, utilization, and performance, and balances resource allocation through priority-based intelligent job scheduling, a job queue structure tuned to the user’s needs, and single-job resource limits – all of which enable high and efficient resource utilization and rapid job turnaround.

The efficient design, operation, and maintenance of NCAR’s data-centric HPC environment supports the scientific productivity of its user community. CISL’s continued leadership in providing discipline-focused computing and data services is a critical role for NCAR as a national center. The GLADE file system supports the efficient management, manipulation, and analysis of data, which is crucial for progress on the grand challenges of Earth system numerical modeling, and every user of CISL’s HPC environment benefits from its data-centric configuration. The addition of a Campaign Storage tier to the GLADE storage system provided archival resources for efficient access to data generated and utilized as part of computational campaigns. It has also freed up the high-performance storage capacity being used for long-term storage and prevented the use of deep archive tape resources for short-term use. Thus, the Campaign Storage tier is part of CISL’s overall storage and data management strategy going forward.

Computing and analysis services

In FY2019, CISL continued operating the Cheyenne supercomputer, a Hewlett Packard Enterprise (HPE) SGI 8600 ICE-XA system with dual-socket, 18-core Intel Xeon Broadwell processors on 4,032 nodes (145,152 total cores), 315 terabytes (TB) of memory, and peak processing power of 5.34 petaflops (5.34 quadrillion floating point operations per second). Cheyenne, with a service lifetime spanning calendar years 2017 through 2021, provided significant value throughout FY2019 by serving up over 972 million CPU hours and executing more than 6.28 million jobs.

NCAR’s new data analysis, visualization, and machine learning system, Casper, was released to users on October 3, 2018. Casper hardware consists of 28 Supermicro nodes featuring Intel Xeon Gold (Skylake) processors. Six of the nodes – including two added to the cluster in February 2019 – are provisioned with dense NVIDIA Tesla V100 GPU and large-memory configurations to support explorations in machine learning and deep learning in atmospheric and related sciences. Casper delivered 2.3 million CPU hours to 1.3 million jobs in FY2019.

The addition of Casper allowed CISL to decommission two aged analysis and visualization clusters called Geyser and Caldera, which were acquired in FY2013 with Cheyenne’s predecessor, Yellowstone.

The following table and figures provide reliability, availability, and utilization metrics for Cheyenne and Casper during FY2019.

FY2019 Reliability Metrics

Cheyenne

Casper

System Availability

98.82%

97.78%

Downtime: Hardware & Software

1.18%

2.22%

Downtime: Scheduled Maintenance & Environmental

4.49%

2.29%

User Availability

94.33%

95.49%

Average User Utilization

80.32%

14.21%

 

Cheyenne availability graph
Figure 1. NCAR’s Cheyenne supercomputer was well utilized during the fiscal year.

 

Casper availability
Figure 2. The Casper data analysis, visualization, and machine learning resource experienced rapid growth in utilization during FY2019.

 

Central storage environment and services

The GLADE system procured with Yellowstone in FY2012 was decommissioned at the end of calendar year 2018. Referred to as GLADE-1, it had provided centralized storage to support computational, data analysis, and visualization tasks for six years. Its decommissioning followed the FY2017 deployment of GLADE-2, which had an initial usable capacity of 20 PB that was expanded later in the year to 38 PB.

In FY2019, new versions of system software were installed on both the new GLADE-2 and Campaign Storage systems. These provided additional performance and features for data protection and integrity. The updates also enhanced the robustness of the systems by providing optimal performance, reducing the time to bring them up, and optimizing server/client communications.

Finally, in FY2019 CISL augmented the storage capacity of the Campaign Storage tier by adding 5 TB of usable disk space.

Evaluating future technologies

In FY2019, CISL began evaluating Globus as a common data-transfer interface across all archival tiers of the HPC storage architecture. The evaluation was completed and Globus has been deployed to support the GLADE, Campaign Storage, and HPSS systems as well as the use of Amazon Web Services Simple Storage Service (AWS S3) buckets. 

CISL also began evaluating developments such as the limited bandwidth and rising power costs of dynamic random access memory and the high latencies and low I/O operations per second (IOPS) of hard disk drives. For example, HPC compute nodes built with two multi-core processors are being transitioned to processors enabled for high-bandwidth memory (HBM). In another development, new heterogeneous architectures (for example, Intel Xeon and Marvell ThunderX) are being inserted as tiers between disk and memory. These new architectures include components customized for particular functions and many-core processors with hundreds of processing elements that are fed by stacked, HBM, and non-volatile, very-low-latency, solid-state storage devices. 

CISL deployed a small, four-node test cluster in the High-Performance Computing Futures Lab (HPCFL) in FY2019 in order to mitigate risks associated with the new ARM ThunderX2 system. That system will provide the necessary resources to allow NCAR to port, test, and benchmark Earth system science applications such as the CESM, MPAS, WRF, DART, and MuRAM models and their associated software stacks, including the necessary libraries, tools, and system software on ARM processors.

As with the ThunderX2 system, the HPCFL has allowed CISL to evaluate the function and manageability of a wide variety of technologies. Recent FY2019 projects include scheduler integration with the AWS cloud platform and the evaluation of two different 100-gigabit Ethernet fabrics, which may allow CISL to consolidate and simplify its HPC network in the future. With the on-site HPCFL, CISL has the unique opportunity to directly evaluate how difficult a potential future system would be to manage. CISL recently deployed a system based on the NEC Aurora accelerator card, for example, and discovered that while it is possible to get optimized benchmark results, the air-cooled accelerator card experiences overheating in real-world conditions. In addition, the HPCFL has served as a testbed for a wide variety of software technologies, notably the Spack package manager and the development of new features for the Slurm scheduler.

Procurement activities and roadmap

CISL formally kicked off the procurement project for the follow-on to Cheyenne in FY2019. The forthcoming system will be the third housed at the NCAR Wyoming Supercomputing Center (NWSC) and is identified as “NWSC-3” for procurement purposes. CISL began the multi-year procurement process in June 2018 by assembling a team of management and technical staff to evaluate the overarching requirements for HPC and data-storage resources to replace those that were acquired as part of the Cheyenne procurement. A technical team began gathering requirements for the new system in September 2018, reviewing lessons learned from the Cheyenne procurement, assessing the existing workload and future user requirements, obtaining non-disclosure briefings from multiple vendors, and discussing the anticipated system requirements with prospective vendors. Select members of the CISL team also visited other large HPC installations to gather information for developing the next system. 

In addition, science requirements were developed based on input from 44 computational scientists from across NCAR and the university community. Members were invited to represent the range of science domains, common use cases, and anticipated use cases, including GPU-accelerated codes and machine learning. A subset of that team was then selected for their technical expertise with HPC systems outside of NCAR. CISL convened meetings at which team members were presented with information about the procurement process, were asked to provide descriptions of their use of NCAR HPC systems, and helped craft a survey of the broader user community. This external, science-focused team finalized a set of recommendations for the NWSC-3 technical specifications based on the Cheyenne Workload and Usage Analysis Study, user survey responses, and CISL’s analysis of their use-case input.

These efforts evolved into refining a set of potential technical requirements for the HPC and data-storage resources. In order to gather additional technical and budgetary information, CISL worked with the UCAR Contracts Office to issue a request for Information (RFI) in mid-February 2019. The RFI solicited technical and budgetary information for: 

  1. a next-generation HPC system, 

  2. an experimental research and development computational system comprising emerging many-core, GPU/accelerator, and/or alternative processor technologies, and

  3. a high-performance storage file system.

The RFI invited responses to one, two, or all of the above resources, depending on the supplier’s area of expertise. Five major suppliers responded in late March 2019. While the RFI budget was $30M, the responsive vendors all concluded that the budget was not adequate to meet the stated performance requirements. The RFI responses were assessed by the working teams, which made four key findings:

  1. While the solutions were interesting, they were disappointing from a technical perspective; that is, they provided little of a technically innovative nature.

  2. The solutions provided significantly less computational capacity and capability than anticipated.

  3. The solutions were costlier than anticipated.

  4. The high-IOPS flash storage tier was much costlier than anticipated.

The RFI was followed by separate full-day, co-design/co-architecture meetings with the four vendors that submitted complete proposals. The vendors were asked to discuss the rationale and design point of their proposed configurations. The co-design discussions resulted in guidance to the vendors for subsequent configuration revisions. CISL also released its High-Performance Computing Benchmarks in July in order to give prospective vendor and supplier benchmarking teams the opportunity to develop their projections. CISL anticipates releasing a formal request for proposal in FY2020, with an award slated for early FY2021. Production availability for the NWSC-3 system is planned for FY2022. 

Finally, CISL began the initial project scoping, schedule coordination, and process for NWSC facility infrastructure upgrades that are required to support the new HPC system. While the supercomputer-specific facility enhancements must be delayed until after vendor selection, other facility fit-up work not specific to a particular vendor’s solution can be commenced before the award is extended. 

Campaign Storage

CISL’s storage strategy encompasses all high-performance storage resources from high-bandwidth, high-IOPs, flash-based storage to tape-based, deep archival systems. To more effectively manage data workflows and resources, in FY2019 CISL designed an end-to-end storage strategy that more closely matches the scientific workflow while optimizing investments in storage. The Campaign Storage tier, for example, is suitable for workflows that require retention of data for the duration of a project, typically five years. The data may be actively used during this time period, but access to the data does not require the same performance characteristics as computational workflows, nor are the data ready for permanent archival.

Cloud-based object storage solution

HPC users are facing a data problem, particularly in the areas of data volume (that is, huge numerical model outputs) and variety (large numbers of disparate data sets from multiple sources with inconsistent data-management practices). In order to provide HPC users the ability to compute and analyze large data collections while shielding them from the underlying complexities of file formats, storage location, parallel data analysis, and other details, CISL is exploring newer I/O technologies to develop an end-to-end storage strategy better suited to current computational and data analysis workflows. 

Tape archive solution

CISL’s tape archive solution, HPSS, provides long-term, durable storage for the scientific data holdings entrusted to NCAR. Throughout 2019, CISL has been working with the community of scientists to transition their workflows so that active archival-class data are written to Campaign Storage and only cold data are written to HPSS. This effort has resulted in reducing the growth rate of the archive from 558 TB per month to 153 TB per month and is a major step in transitioning to a truly cold archive. This also resulted in cost savings to NCAR by avoiding the need to purchase new tape media.

The HPSS tape archive solution is nearing its end of life, and CISL has been developing the architectural design and specifications for a new, cold tape archive solution. This effort represents another component in the storage architecture redesign process that has been under way in CISL for the past couple of years. The new archive is being architected to house curated data holdings with a “write once, read almost never” access pattern. It will be the final component of CISL’s storage architecture redesign of GLADE, Campaign Storage, cloud-based object storage, and deep archive. The new tape archive solution is scheduled to be in production mid-2020.