Provide supercomputing resources

The NWSC high-performance computing (HPC) environment includes petascale supercomputing resources, the Globally Accessible Data Environment’s (GLADE) centralized high-performance file systems, the NCAR HPSS data archive, and data analysis and visualization resources. CISL’s Operations and Services Division operates these systems at the highest levels of availability and performance to enable the science, research, and discovery of the atmospheric and related sciences communities that are served by these resources. Initially placed into production in late 2012, the Yellowstone HPC system, along with its Geyser and Caldera data analysis and visualization systems, have been operated for nearly four years. During FY2016, these systems supported 528 million core-hours of computing for over 12.8 million jobs.

In the upcoming year, CISL’s computational environment will be augmented with the new Cheyenne supercomputer that will provide more than three times the computational capacity of Yellowstone, plus additional data storage and archival resources. CISL will also begin the procurement process for an advanced data analysis and visualization platform to replace the Geyser and Caldera systems.

Cheyenne supercomputer at NWSC
The 5.3 petaflops Cheyenne supercomputer, after installation at the NCAR-Wyoming Supercomputing Center (NWSC), is undergoing final check-out and testing and will begin user production in January 2017.


CISL’s high-performance computing and storage environment efforts are focused on providing robust, reliable, and secure high-performance computing resources in a production research environment, and on supporting that environment for the thousands of users and hundreds of projects spanning NCAR, universities, and broader atmospheric science community. CISL continuously monitors system usage and performance, and it balances resource allocation through priority-based intelligent job scheduling, a well-tuned job queue structure, and single-job resource limits. CISL’s commitment to a data-intensive computing strategy extends beyond the HPC/GLADE environment and includes a full suite of science gateway and data portal services. CISL continues to lead the community in developing data services that can address the future challenges of data growth, preservation, curation, and management. CISL also leads in supporting NSF’s requirement for data management plans.

The efficient design, operation, and maintenance of NCAR’s data-centric HPC environment supports scientific productivity at research universities, NCAR, and other organizations, and CISL’s ongoing leadership in providing discipline-focused computing services is a critical role for NCAR as a national center. CISL also operates a supercomputer that provides real-time numerical weather predictions for the United States Antarctic Program, Antarctic science, and international Antarctic efforts.

The GLADE centralized file system supports the efficient management of data, which is critical for progress on the grand challenges of numerically modeling the Earth System, and every user of CISL’s HPC environment benefits from its data-centric configuration. Users can now arrange their workflows to use stored data directly without first needing to move or copy it.

NWSC-2 procurement resources

On behalf of CISL, UCAR issued the NWSC-2 Request for Proposals in FY2015 for the acquisition of a new HPC resource to replace Yellowstone and for an augmentation of the GLADE environment. Awards were made in December 2015 for a new supercomputer, named Cheyenne, and for additional GLADE file systems. At end-FY2016, the 5.3 petaflops Cheyenne supercomputer was delivered, installed, and powered up at the NWSC and began final check-out and testing. Similarly, 20 petabytes of additional storage, with an aggregate bandwidth of 220 Gigabytes/second has been installed at the NWSC and is being integrated into the GLADE environment. These additional resources will become available for production use by January 2017.

Data-centric environment

CISL’s Globally Accessible Data Environment (GLADE) provides centralized high-performance file systems and forms the hub of CISL’s data-centric HPC environment. GLADE provides a shared, high-performance (90 GB/second), high-capacity (16.4 PB) central file system connecting all the computing and support systems required for scientific computation and associated workflows. This centralized design, independent of the HPC resources, improves scientific productivity and reduces costs by eliminating the expense (in time and energy) of moving data between systems and/or maintaining multiple copies of data. Temporary “scratch” spaces and longer-term “work” spaces are available to all users of the supercomputer systems, and long-term project space is allocated through the various allocation panels. GLADE also plays a growing role in hosting curated data collections from CISL’s Research Data Archive (RDA), NCAR’s Community Data Portal, the EOL Metadata Database and Cyberinfrastructure, and NCAR’s Climate Simulation Data Gateway. During FY2016, CISL enhanced GLADE’s network connectivity with 40 Gigabit-per-second Ethernet data-transport backbone. By the end of 2016, CISL will expand GLADE to a total data capacity in excess of 36 petabytes and provide over 220 Gigabyte-per-second aggregate bandwidth connectivity to the Cheyenne supercomputer.

Yellowstone supercomputer at NWSC
The IBM iDataPlex supercomputer named Yellowstone has been the primary computational platform for nearly four years – providing a highly available, highly utilized HPC environment in support of NCAR and university science.

High Performance Computing (HPC)

CISL continued operating the Yellowstone supercomputer, an IBM iDataPlex cluster with 4,536 nodes, 72,576 Intel Xeon E5-2670v2 (Sandy Bridge) processing cores, 145 terabytes of memory, and a peak processing power of 1.5 petaflops (1.5 quadrillion floating point operations per second). Yellowstone’s design and configuration target the data-intensive computing needs of the Earth System sciences, disciplines that push the limits of computational and data systems. When initially installed in 2012, Yellowstone was ranked as the 13th most powerful supercomputer in the world. In the HPC world, newer more powerful systems are continually displacing older systems on the TOP500 list, and by June 2014, Yellowstone was the 29th most powerful system on that list. It dropped to 49th place in June 2015, and by June 2016 it ranked as the 68th most powerful system.

The new Cheyenne supercomputer is an SGI ICE-XA cluster with 4,032 nodes, 145,152 Intel Xeon E5-2697v4 (Broadwell) processing cores, 315 terabytes of memory, and a peak processing power of 5.34 petaflops. Similar to Yellowstone, Cheyenne’s design and configuration will provide balanced I/O and exceptional computational capacity for the data-intensive needs of its user community. Cheyenne is expected to rank in the top 20 most powerful systems in the world when it is commissioned into production service in January 2017.

Data Analysis and Visualization

NWSC’s HPC environment is completed by dedicated Data Analysis and Visualization (DAV) systems for analyzing computer-generated data, forecast model intercomparisons, and observational data analysis. NWSC’s DAV resources are comprised of the Geyser, Caldera, and Pronghorn systems. Geyser and Caldera are specifically configured for tasks that utilize NVIDIA graphics processing units (GPUs). The 16-node Geyser cluster, with 1 terabyte of memory and a single NVIDIA K5000 GPU per node, was designed for data synthesis, analysis, and visualization tasks. The 16-node Caldera cluster, with two NVIDIA K20X GPGPUs per node, was designed for computationally intensive, GPGPU-accelerated parallel applications and data analysis tasks. Pronghorn was initially an Intel Phi accelerator evaluation system; after decommissioning the Phi adapters it was repurposed to augment the Caldera system but without GPGPU accelerators. For more information, see the Data Analysis and Visualization section of this report.

Data Sharing Services

The NCAR Data Sharing Service continued to provide researchers a way to share large data sets with collaborators around the world using a simple web-based interface. Based on Globus Plus software (a tool that emerged from a partnership with the University of Chicago and Argonne National Laboratory), the NCAR Data Sharing Service provides access to GLADE file systems, data transport servers, and high-speed network connectivity to external research networks.

CISL’s commitment to a data-intensive computing strategy extends beyond the Yellowstone environment and includes a full suite of science gateway and data portal services. CISL continues to lead the community in developing data services that can address the future challenges of data growth, preservation, curation, and management. CISL also leads in supporting NSF’s requirement for data management plans. Our disk and tape-based HPSS archival storage systems provides an efficient, safe, and reliable environment for long-term offline hosting of datasets, yet provides user-friendly interfaces for quickly retrieving stored data. CISL has streamlined and improved its data services through the data-centric design of the NWSC environment, and particularly via the GLADE file systems.

Antarctic Mesoscale Prediction System Support (AMPS)

CISL continued to operate the 84-node supercomputing cluster named Erebus, which is used exclusively by and in support of AMPS, an experimental, real-time numerical weather prediction system that produces twice-daily weather predictions for the Antarctic continent. Erebus delivered 8.6 million core-hours during FY2016, providing forecasts in support of the U.S. Antarctic Program flight operations and polar observatory, and to support research and education activities involving Antarctic meteorology.

CISL included an option in the NWSC-2 procurement for a new stand-alone system to replace Erebus, but upon evaluating the proposals, found it would be more economical to utilize the AMPS incremental funding to augment Cheyenne and dedicate the additional nodes to the AMPS efforts. During FY2017, CISL will assist AMPS in porting the forecast system to Cheyenne; once that is done, and after an operational series of forecasts are carried out in parallel with those on Erebus to certify Cheyenne’s results, the Erebus system will be decommissioned and repurposed as part of CISL’s High Performance Futures Laboratory.

Data archival services

The High Performance Storage System (HPSS) is an advanced, highly scalable and flexible mass storage system and archival resource that supports both NCAR’s supercomputing environment as well as divisional servers run by other NCAR laboratories and UCAR programs. New HPSS hardware was installed at the NWSC to support the new supercomputing environment there, and the HPSS tape libraries in Boulder are now used to store critical data specified in the CISL Business Continuity plan. This disaster recovery service currently supports the RDA, NCAR’s EOL, and UCAR’s COSMIC program. HPSS data holdings at the NWSC stand at around 66 PB and 241 million files, with growth since Yellowstone began production averaging around 1.25 PB per month. After recent acquisition of two additional tape libraries at NWSC, the HPSS system has doubled its maximum data capacity from 160 to 320 petabytes. A major HPSS upgrade is underway and is scheduled to be completed in the second quarter of FY2017. For more information, see the Data archive section of this report.

HPC Futures Laboratory

CISL continued operating and enhancing its HPC Futures Lab that focuses on HPC research, which is relevant for improving the NCAR’s HPC environment and helping CISL assess new technologies that may be available in future production systems. The HPC Futures Lab provides system administration, consulting staff, and scientists with a ready-to-use environment where cutting-edge technology can be deployed and tested. Some of the current research is examining areas such as heterogeneous architectures, GPGPUs, coprocessors, resource managers, job schedulers, Message Passing Interface (MPI) software, benchmarks, performance tuning, file systems, and a variety of computation-intensive applications. For more information, see the High Performance Futures Lab section of this report.

System specifications

The following table provides the technical details for the supercomputing systems maintained and installed by CISL during FY2016.

NWSC HPC resources
NWSC high-performance computing resources and their key attributes.


System reliability and usage

During FY2016, Yellowstone was highly available and utilized, with an average scheduled availability of 99.7% and user utilization of 95.2%, while the Caldera and Geyser Data Analysis and Visualization systems averaged 98.9% scheduled availability and 29.8% user utilization. The AMPS system, Erebus, was down only for scheduled NWSC work during infrastructure preparation for the new Cheyenne supercomputer, and it averaged 76.0% utilization. Similarly, the GLADE system was highly available. The following table provides key reliability and usage metrics for the NWSC resources during FY2016.

System availability and utilization
Average system availability and utilization during FY2016, along with some key usage and job metrics.


Stability and utilization of Yellowstone and the DAV resources reached new peaks this year. The figure below shows the availability and utilization profiles for FY2016, with annotation of those days where availability was less than 90%. Several of the incidents causing downtime during FY2016 were scheduled in advance, either for software upgrades, networking upgrades, or facility upgrades in preparation for the new NWSC-2 systems: Cheyenne and the GLADE-2 storage resources.

System availability and utilization
FY2016 availability and utilization profiles for the HPC (Yellowstone) and DAV (Caldera and Geyser) resources. Significant downtimes (daily availability of less than 90%) are annotated, and several were due to power outages and scheduled maintenance activities.


Funding

The NWSC environment, including HPC, GLADE, and DAV resources, was made possible through NSF Core funds, with supplemental support from the University of Wyoming. AMPS computing was supported by NSF Special funding.