Production supercomputing status

In FY2015, CISL focused its efforts on ensuring that Yellowstone and related systems – GLADE, Caldera, Erebus, Geyser, and Pronghorn – were operated at the highest levels of performance, availability, and utilization. This section describes some major events and highlights from FY2015.

Balanced supercomputing environment

Throughout the year, CISL continued to deliver on its computing imperative for hardware and software cyberinfrastructure. CISL’s efforts remained substantially focused on providing robust, reliable, and secure high-performance computing resources in a production research environment, and on supporting that environment for thousands of users and hundreds of projects spanning NCAR, UCAR member universities, and the University of Wyoming in a wide variety of disciplines related to the Earth System sciences. CISL resources empower the research community to pursue more innovative investigations, and CISL itself provides the organizational focus, capabilities, and skill sets to support these investigations, including meteorological field campaigns and computational projects of other NCAR laboratories.

CISL’s production supercomputing environment is designed, administered, and managed to provide data-centric computational resources that are balanced to meet the needs of its numerical simulation and data-analysis communities. Further, CISL strives to provide the most effective and useful combination of computational and data storage capability and capacity, augmented by scientific data analysis, visualization, and archival services. CISL works to provide equitable access to these computing and storage resources while achieving high reliability, minimizing job wait times, and maximizing resource throughput and utilization. These objectives require CISL to continuously monitor system usage and performance and to continuously work to balance resource allocation with priority-based intelligent job scheduling, a well-tuned job queue structure, and single-job resource limits.

System software upgrades

The NWSC software environment was very stable throughout FY2015 with a minor upgrade to the LSF resource manager and job scheduler late in the year. The only other software changes involved a series of minor upgrades to the InfiniBand routing algorithm, including the implementation of “routing engine chains” to allow Yellowstone to be routed as a full fat tree while InfiniBand connections to the Globally Accessible Data Environment (GLADE) and the DAV resources can be treated differently.

System specifications

The following table provides the technical details for the supercomputing systems maintained by CISL during FY2015.

 

Yellowstone

Caldera

Geyser

Pronghorn

Erebus (AMPS)

Peak FLOP rate (TF)

1509.6

21.8

14.4

37.7

28.0

Total number of nodes

4536

16

16

16

84

Primary node architecture

IBM dx360 M4

IBM dx360 M4

IBM x3850 X5

IBM dx360 M4

IBM dx360 M4

CPU type

Intel Xeon E5-2670

Intel Xeon E5-2670

Intel Xeon E7-4870

Intel Xeon E5-2670

Intel Xeon E5-2670

CPU microarchitecture

Sandy Bridge EP

Sandy Bridge EP

Westmere EX

Sandy Bridge EP

Sandy Bridge EP

CPU frequency (GHz)

2.6

2.6

2.4

2.6

2.6

CPUs per node

2

2

4

2

2

Cores per node

16

16

40

16

16

Node memory capacity (GB)

32

64

1024

64

32

Node memory type

DDR3-1600

DDR3-1600

DDR3-1066

DDR3-1600

DDR3-1600

Interconnect network

InfiniBand 4x FDR

InfiniBand 4x FDR

InfiniBand 4x FDR

InfiniBand 4x FDR

InfiniBand 4x FDR-10

Interconnect topology

3-tier full fat tree

1-tier full fat tree

1-tier full fat tree

1-tier full fat tree

2-tier full fat tree

Network ports per node

1

1

2

1

1

System bisection bandwidth (GB/sec)

31,100

109

104

109

407

Accelerator/GPU

-

NVIDIA K20X

NVIDIA K5000

-

-

Accelerator peak single-precision FLOP rate (GF)

-

3,950

2,150

-

-

Accelerator peak double-precision FLOP rate (GF)

-

1,310

90

-

-

Accelerators per node

-

2

1

-

-

Accelerator memory capacity (GB)

-

6

4

-

-

Accelerator memory type

-

GDDR5

GDDR5

-

-

Number of compute racks

63

0.5

2

0.5

1

HPC Futures Laboratory

CISL continued running its HPC Futures Lab that focuses on HPC research, something that is relevant for improving the current environment and helping CISL assess new technologies that may be present in future systems. The HPC Futures Lab provides system administration, consulting staff, and scientists with a ready-to-use environment where cutting-edge technology can be deployed and tested. Some of the current research is examining areas such as heterogeneous architectures, GPGPUs, coprocessors, resource managers, job schedulers, Message Passing Interface (MPI) software, benchmarks, performance tuning, file systems, and a variety of computation-intensive applications.

System availability

During FY2015, Yellowstone had an average scheduled availability of 99.8% and user utilization of 94.8%, while the Data Analysis and Visualization systems (Caldera and Geyser) averaged 97.8% scheduled availability and 21.7% user utilization.

FY2015 Availability Metrics

GLADE

Yellowstone

DAV

Total user availability

99.7%

98.9%

97.1%

Downtime: Scheduled maintenance and environmental

0.3%

0.9%

0.7%

Scheduled availability

100%

99.8%

97.8%

Stability and utilization of Yellowstone and the DAV resources reached new peaks this year. The figure below shows the availability and utilization profiles for FY2015, with annotation of those days where availability was less than 90%.

HPC availability
The availability and utilization profiles for the HPC (Yellowstone) and DAV (Caldera and Geyser) resources during FY2015, showing downtimes due to power outages and for scheduled maintenance, InfiniBand testing and routing algorithm changes, and software updates.

System optimization

Throughout FY2015, CISL’s system administration and consulting services staff continued to focus on stabilizing, refining, and optimizing the user environment and on working with end users to improve application resilience and performance.

Funding

This work is made possible through NSF Core funds, including CSL funding, and NSF Special Funds were used for the AMPS resources.