High Performance Computing for Operational Modeling

Background

RAL's research and development in advanced, small–footprint computing is focused on providing scalable solutions to high–resolution numerical modeling with demanding data storage requirements. By keeping abreast of the increasing speed and density of rack–mounted cluster computing, RAL delivers climate analyses and real–time weather predictions that fit into an ever-decreasing footprint.  As hardware vendors continue to provide higher density computing, RAL is able to move toward more green computing, with lower power and cooling requirements.

An important feature of RAL/NSAP's computing design is its ability to provide solutions across scales.  Given the need to deploy systems that range in size from 32 cores to 832 cores, and applications that range from global climatology to large eddy simulations, the flexibility and extensibility of the computing architecture becomes a critical component for success.  Cumulative number of computing cores presently exceeds 3000, across 475+ nodes in use by NSAP projects.

Computing

In order to effectively utilize core dense compute resources (nodes) for parallel codes (WRF) as well as serial based post-processing, various software layers have been under examination to improve performance across differing job sizes. During FY16 the testing, analysis and impact measurements have included everything from Intel compilers, differing versions of openmpi, Linux kernel power features, as well as combinations of Infiniband OFED software stacks and FDR IB equipment.  Through the evaluation of these tools, technical staff can help assess specific efficiencies that can be gained when sizing hardware architectures to varied job-type and runtime requirements, or to assess how non-local, hosted HPC centers might benefit NSAP projects by utilizing shared computing resources in the future.

Storage

RAL continues to utilize improvements in data storage management through the deployment of NAS (network attached storage) systems that are simultaneously accessible by a variety of project clusters.  The transition away from RAID disks that were directly attached to a single computing cluster to NAS data repositories that are accessible across the local area network has been accompanied by greater reliability, increased data accessibility, and less time spent on storage maintenance by system administrators and users.  In addition to the increased reliability, the NAS solution provides a growth path that allows for incremental additions to data storage, while maintaining consistent, logical namespace.  The end result is that data users no longer have to spend time juggling datasets across individual disks, leaving it to the NAS architecture to manage the mapping between logical and physical space.

Further enhancing the scalability of application to storage performance are parallel IO access methods, which each server uses to access file systems over a dedicated gigabit to 10Gigbit network.  The parallel NFS standard minimizes hot-spot contention for data sets and provides a topology where high demand IO requests are balanced over dozens of disk spindles and network ports to provide streaming of data in both write or read modes.

At the end of FY16 NSAP had approximately 228TB of highly fault-tolerant, parallel, network attached object-based storage, with the capability to increase by 100s of TB in FY17 and beyond.  Cumulative storage across various project clusters now exceeds 550TB. In addition to available storage, as a result of commensurate networking upgrades, the data throughput potential to backend storage has grown to 60 Gbps, providing bandwidth for the increasing demands of higher resolution weather and climate forecasting.

Monitoring

In addition to continued expansion into smaller and more efficient use of computing and storage resources, accomplishments continue with the expansion in the use of network-enabled system monitoring and performance analysis tools at the data management layer.  Through the deployment of these tools, technical staff receive real-time alerts and are able to evaluate a historical record of metrics graphically to help diagnose both system and application scalability.  The extensible community-supported, plug-in architecture allows developers to easily adapt existing monitoring examples to varied applications across different computing architecture without the need to write code from the ground up.

Hosted HPC and Cloud Computing

RAL continues to review and leverage access to (remote) hosted HPC services for the purposes of testing various software models on emerging hardware such as GPUs, newer CPU and networking technologies as well as Cloud platforms in order to gain efficiencies as well as reduce potential costs to sponsors or developing projects.   The use of hybrid computing architectures where "right-sized" compute resources can execute job tasks in tandem by optimizing on-premise hardware as well as dynamic and scalable cloud based compute services is another area RAL is exploring from a design, workflow and cost model perspective.  Leveraging cloud technology to produce more efficient, as well as higher quality scientific results, will depend on careful consideration of  such factors as data locality, software model optimizations and workflow tool design.