Provide supercomputing resources

Central to CISL’s Strategic Plan, the NWSC high-performance computing (HPC) environment includes petascale supercomputing resources, NCAR’s Globally Accessible Data Environment (GLADE) centralized file system, the NCAR High Performance Storage System (HPSS) data archive, and Data Analysis and Visualization (DAV) resources. A key focus for CISL’s Operations and Services Division is to operate these systems at the highest levels of availability and performance, thereby enabling the science, research, and discovery of the atmospheric and related sciences communities served by these resources.

During FY2017, CISL’s computational environment was augmented with the Cheyenne supercomputer and a corresponding enhancement to GLADE’s capacity and capability, plus additional data archival resources. Cheyenne is a robust, well-balanced, and forward-looking computing system with an expected operational life of at least five years, and it is capable of delivering at least three times more sustained performance than its predecessor Yellowstone. The Yellowstone HPC system, along with its Caldera and Geyser DAV systems, have been operated for nearly five years. During FY2017, these HPC systems provided a total of more than 1,013 million CPU core hours of production computing for more than 15.9 million user jobs.

Cheyenne supercomputer
The 5.3-petaflops Cheyenne supercomputer entered user production in January 2017.

CISL’s high-performance computing and storage environment efforts are substantially focused on providing robust, reliable, and secure high-performance computing resources in a production research environment, and on supporting that environment for the thousands of users and hundreds of projects spanning NCAR, universities, and the broader atmospheric sciences community. CISL continuously monitors system workload, utilization, and performance, and balances resource allocation through priority-based intelligent job scheduling, a well-tuned job queue structure, and single-job resource limits – all enabling high and efficient resource utilization and rapid job turnaround.

CISL’s commitment to a data-intensive computing strategy extends beyond the computational and data storage environment and includes a full suite of science gateway and data portal services. CISL continues to lead the community in developing data services that can address the future challenges of data growth, management, curation, and preservation. CISL also leads in supporting NSF’s requirement for active data and metadata management.

The efficient design, operation, and maintenance of NCAR’s data-centric HPC environment supports the scientific productivity of its user community. CISL’s continued leadership in providing discipline-focused computing and data services is a critical role for NCAR as a national center. The GLADE centralized file system supports the efficient management, manipulation and analysis of data, which is crucial for progress on the grand challenges of Earth System numerical modeling, and every user of CISL’s HPC environment benefits from its data-centric configuration.

NWSC-2 procurement resources

Following an 18-month NWSC-2 procurement effort, the 5.3-petaflops Cheyenne supercomputer, a 38-petabyte expansion of GLADE, and a 280-gigabyte/second connection between the two new resources were delivered to and installed within the NWSC in late FY2016. During the first three months of FY2017, Cheyenne and the new storage were successfully tested, accepted, and placed into scientific production on schedule in January 2017. In the subsequent months, Cheyenne and the new GLADE resources have proven to be stable, reliable, highly productive, and much more energy efficient than their predecessors, thus enhancing NCAR’s CI environment.

Cheyenne successfully ran a production workload for nine months during FY2017, and as previously planned, CISL will decommission the Yellowstone system at the end of calendar year 2017. CISL will continue to operate the Caldera and Geyser DAV systems until they are replaced by a next-generation DAV resource in mid 2018. CISL’s procurement process for the next-generation DAV system is underway.

High Performance Computing

This year, CISL began production operation of the Cheyenne supercomputer, an SGI ICE-XA system with 4,032 nodes, 145,152 Intel Xeon E5-2697v4 (Broadwell) processing cores, 315 terabytes of memory, and a peak processing power of 5.34 petaflops (5.34 quadrillion floating-point operations per second). Similar to Yellowstone, Cheyenne’s design and configuration provide exceptional and balanced I/O and computational capacity for the data-intensive needs of NCAR’s user community. Cheyenne debuted at 20th on the 48th edition (November 2016) of the industry’s TOP500 list and 18th on the High Performance Conjugate Gradients (HPCG) Benchmark. In the HPC world, newer more powerful systems are continually displacing older systems on the TOP500 list. As of the most recent list (June 2017), Cheyenne was 22nd in the TOP500. During FY2017, Cheyenne provided more than 536 million CPU core hours for users by executing more than 1.03 million jobs.

In FY2017, CISL continued operating the Yellowstone supercomputer, an IBM iDataPlex cluster with 4,536 nodes, 72,576 Intel Xeon E5-2670v2 (Sandy Bridge) processing cores, 145 terabytes of memory, and a peak processing power of 1.5 petaflops (1.5 quadrillion floating point operations per second). When initially installed in 2012, Yellowstone was ranked as the 13th most powerful supercomputer in the world. By June 2014, Yellowstone was the 29th most powerful system on the TOP500 list; by June 2015, it had dropped to 49th place; in June 2016, it ranked as the 68th most powerful system; and is ranked 92nd as of the latest June 2017 TOP500 list. While targeted for decommissioning at the end of 2017, Yellowstone continued to provide significant value through the fiscal year; serving up over 467 million CPU core hours executing over 4.88 million jobs.

Yellowstone supercomputer
The IBM iDataPlex supercomputer, Yellowstone, has been the primary computational platform for nearly four years. It has provided a highly available, highly utilized HPC environment in support of NCAR and university science.

GLADE data-centric environment

NCAR’s HPC environment would be useless without corresponding high-performance data resources. NCAR’s Globally Accessible Data Environment (GLADE) provides NCAR’s centralized high-performance file systems and forms the hub of its data-centric HPC environment. This centralized design, independent of the HPC resources, improves scientific productivity and reduces costs by eliminating the expense (time, energy and latency) of moving data between systems and/or maintaining and synchronizing multiple copies of data. Temporary “scratch” spaces and longer-term “work” spaces are available to all users of the supercomputer systems, and long-term “project” space is allocated through the various allocation panels. GLADE also plays a continuing and growing role as host to curated data collections from CISL’s Research Data Archive (RDA), NCAR’s Community Data Portal, the EOL Metadata Database and Cyberinfrastructure, and NCAR’s Earth System Grid portal. GLADE utilizes high-speed InfiniBand connectivity to the HPC systems as well as a 40 gigabit/second Ethernet data-transport backbone.

GLADE’s initial implementation (aka GLADE-1), provisioned coincident with the installation of the Yellowstone supercomputer, provided a shared, high-performance (90 GB/second), high-capacity (16.4 PB) central file system serving all the computing and support systems. Its NWSC-2 enhancement in FY2017 (aka GLADE-2) expanded the total capacity to more than 38 petabytes and provided better than 280 GB/second aggregate bandwidth connectivity to the Cheyenne supercomputer. For more information on GLADE, see the centralized high-speed data storage section of this report.

Data archival services

Just as NCAR’s HPC systems need GLADE to provide and store data, the HPC systems and GLADE require a back-end archive to provide a long-term repository for observationally or computationally derived data that is historically important or cannot be reproduced. The NCAR High Performance Storage System (HPSS) is an advanced, highly scalable and flexible mass storage system and archival resource that supports both NCAR’s supercomputing environment as well as divisional servers run by other NCAR laboratories and UCAR programs. HPSS became NCAR’s only data archival facility after March 29, 2011, when the NCAR’s locally developed 25-year old Mass Storage System was decommissioned. In the years since, CISL has annually augmented the capacity and capability of the archive’s infrastructure; and enhanced the supporting HPSS software. At the end of FY2017, the total holdings of the NCAR HPSS archive have grown to over 81 petabytes, with unique holdings of over 79 petabytes. For more information, see the NCAR HPSS archive section of this report.

Antarctic Mesoscale Prediction System support

A critical part of NCAR’s HPC workload, the Antarctic Mesoscale Prediction System (AMPS), was migrated to Cheyenne during FY2017. Twice daily, AMPS produces numerical weather predictions over the Antarctic continent in support of the U.S. Antarctic Program flight operations and polar observatory, and to support research and education activities involving Antarctic meteorology. The AMPS operational forecasts were migrated from the old Erebus system to Cheyenne in June, and provide Antarctic weather forecasts by leveraging the Weather Research and Forecasting (WRF) model.

CISL operated the 84-node supercomputing cluster, Erebus, from July 2012 through early July 2017. Erebus had been a dedicated system for AMPS, providing 6.6 million CPU core hours of computing to over 3.1 million AMPS jobs during FY2017. Subsequently, the Erebus system was decommissioned and repurposed as part of CISL’s High Performance Futures Laboratory.

Data Analysis and Visualization

The NWSC’s HPC environment would be incomplete without dedicated Data Analysis and Visualization (DAV) systems for analysis and reduction of computer-generated data, forecast model intercomparisons, and observational data analysis. NWSC’s DAV resources are comprised of the Caldera and Geyser systems. Caldera and Geyser are specifically configured for tasks that utilize NVIDIA graphics processing units (GPUs). The 16-node Geyser cluster – with 1 terabyte of memory and one NVIDIA K5000 GPU per node – was designed for data synthesis, analysis, and visualization tasks. The 30-node Caldera cluster – with nodes identical to those on Yellowstone, but having twice as much memory – was designed for computationally intensive, GPGPU-accelerated parallel applications and data analysis tasks. Sixteen of Caldera’s nodes each contain two NVIDIA K20X GPGPUs.

During FY2017, CISL began the NWSC-2a procurement process for a next-generation DAV system which is anticipated to replace Caldera and Geyser in mid-2018. For more information on CISL’s DAV support, see the data analysis and visualization section of this report.

Data services

The NCAR Data Sharing Service continued to provide researchers a way to share large data sets with collaborators around the world using a simple web-based interface. Based on the Globus Plus software (a tool that emerged from a partnership with the University of Chicago and Argonne National Laboratory), the NCAR Data Sharing Service provides access to GLADE file systems, data transport servers, and high-speed network connectivity to external research networks.

CISL’s commitment to data-intensive computing extends beyond NCAR’s HPC environment and includes a full suite of science gateway and data portal services. CISL continues to lead the community in developing data services that can address the future challenges of data growth, management, curation, and preservation. CISL also leads in supporting NSF’s requirement for data management plans. Our disk- and tape-based HPSS archival storage systems provide an efficient, safe, and reliable environment for long-term offline hosting of datasets, and they offer user-friendly interfaces for quickly retrieving stored data. CISL has streamlined and improved its data services through the data-centric design of the NWSC environment, and particularly via the GLADE file systems and storage infrastructure.

HPC Futures Laboratory

CISL continued operating and enhancing its HPC Futures Lab (HPCFL). The HPCFL is a hardware and software testbed focused on HPC research and emerging HPC technologies; a crucial element for helping CISL assess technologies destined for future production systems and for improving NCAR’s HPC environment. The HPCFL provides system administration, consulting staff, and scientists with a ready-to-use environment where cutting-edge technology can be deployed and tested. Some of CISL’s current efforts are examining hardware areas such as: heterogeneous architectures, GPGPUs, coprocessors, hierarchical memory and storage; and software areas such as resource managers, job schedulers, Message Passing Interface (MPI) software, benchmarks, performance tuning, file systems, and a variety of computation-intensive applications. For more information, see the High Performance Futures Laboratory section of this report.

Production system specifications

This hardware configuration table summarizes the technical details of the production supercomputing systems that were operated and maintained by CISL during FY2017.

NWSC systems configuration
NWSC high-performance computing resources used during FY2017 and their key attributes.

System reliability and usage

During FY2017, Cheyenne and Yellowstone were both highly available and well utilized. Cheyenne’s average scheduled availability was 99.0% and user utilization was 61.9%. Yellowstone’s average scheduled availability was 99.7% and user utilization was 90.2%. The DAV systems Caldera and Geyser averaged 96.6% scheduled availability and 30.0% user utilization. This table provides key reliability and usage metrics for the NWSC HPC resources during FY2017.

NWSC performance
Average system availability and utilization during FY2017, along with some key usage and job metrics.

The figures below show the availability and utilization profiles for Cheyenne and Yellowstone during FY2017, with annotation of those days where availability was less than 90%. Several of the incidents causing user downtime during FY2017 were either preparatory work by CISL for the new Cheyenne and GLADE-2 resources (i.e., before January) or either software, networking, or facility upgrades in support of the new equipment and to enhance the NWSC’s HPC and storage infrastructure.

Cheyenne performance
FY2017 availability and utilization profiles for the Cheyenne HPC system. Downtimes, primarily due to power outages and scheduled maintenance activities, resulting in a daily availability of less than 90% are annotated.


Yellowstone and DAV performance
FY2017 availability and utilization profiles for the HPC (Yellowstone) and DAV (Caldera and Geyser) resources. Downtimes, primarily due to power outages and scheduled maintenance activities, resulting in a daily availability of less than 90% are annotated.

Funding source

The NWSC environment, including HPC, GLADE, and DAV resources, is made possible through NSF Core funds, with supplemental support from the University of Wyoming. AMPS computing was supported by NSF Special funding.