Lead the initiative for HPC cloud computing and related technologies

Cloud computing is generally ahead of the curve in adopting many of today’s popular and emerging technologies, yet it is a fledgling in High Performance Computing (HPC). So it requires partnership and contributions from organizations like CISL to facilitate and foster the mindset and architectural changes required by HPC applications. CISL is therefore working to expand our user communities’ reach by exposing them to both public and private cloud computing environments and technologies, to gather feedback for future improvements, and to help define cloud computing science, especially for HPC use.

CISL’s cloud computing initiative started with cloud technology and services briefings by major cloud computing providers like Google, Amazon, Microsoft, IBM, VMWare, Penguin, and Nimbix; discussions with collaborators; technology analyses; the available solutions; and services offered by the industry. This level of access to on-premises and public cloud computing resources has the potential to allow our user communities to go beyond limited experience with traditional on-premises systems to a more customizable, elastic environment to test new cloud computing architectures and technologies. CISL began working with cloud computing so it can provide an on-premises cloud infrastructure with bare-metal reconfiguration privileges to its users, while exposing them to the latest hardware and software technologies available on public cloud computing systems.

By enabling and facilitating atmospheric and related science workflows in a cloud computing environment, we may be able to encourage new communities of researchers who do not have access to large systems and that our current traditional on-premises systems do not serve. The update cycle in cloud technology is more frequent, and the cloud vendors can often provide access to technology faster than what CISL can deploy in its HPFL.

CISL’s High-performance Services Section (HSS) is mitigating issues with respect to high-performance computing in the cloud. We are actively exploring public and commercial cloud offerings, private on-premises cloud solutions, as well as a hybrid solution.

HSS evaluated and implemented a customized Amazon Web Services infrastructure to provide an elastic and customizable training environment for workshops and seminars. HSS also evaluated the commercial offerings from Google, Amazon, Azure, and Penguin to position us for providing useful recommendations for NCAR’s usage and the applicability of commercial cloud technologies.

Analyses of this work led HSS to focus on leveraging dataset distribution, off-site storage, and the ability to use cloud-on-demand when production resources are not available for production workloads. The AMPS forecast was successfully run on cloud provider instances with workflows from data ingress to computation, analysis, and data egress. Using cloud providers increases AMPS forecast availability by providing secondary and tertiary redundancy.

NCAR created strategic use cases to evaluate cloud vendors’ offerings for a representative sample of NCAR’s HPC models and data analysis suites. The focus of the trial, thus far, has been on NCAR’s HPC MPI-based models, specifically the WRF weather forecasting and CESM global climate models. The full AMPS WRF forecast and data workflow were also run as part of this trial period. In addition to CESM and WRF, CISL is also running the CISL High Performance Computing Benchmark suite to evaluate supercomputers proposed and accepted for use at NCAR. The CISL High Performance Computing Benchmark suite measures supercomputer interconnect, processor, and memory performance from the perspective of NCAR applications.

The scale and interconnect performance level of systems available from cloud providers does not match that of world-class supercomputers in the Top500 list. A scaling comparison was made between CISL’s 72-node SGI ICE XA test system, Laramie, and the highest-performing offerings from cloud providers. Jobs were run on Penguin’s POD and an AWS MPI cluster that was built and tested using CloudFormation tools. CISL also collaborated with Rescale to create a self-contained application (a subset of the CISL High Performance Computing Benchmark) to run on providers (Azure, AWS, GCP) via Rescale’s portal. CISL is in the process of capturing the performance, user experience, ease of use, administrative experience, data workflow, cost, stability, and performance variance so they can be compared with NCAR systems. Additionally, CISL is currently engaged with both Microsoft and Google to set up an agreement between UCAR and their respective cloud businesses (Azure and GCP) via resellers to facilitate initial exposure to their offerings.

HSS has designed and is creating a private NCAR Cloud with targeted availability in the first quarter of FY2018. This environment will be initially staged with current user-facing cloud technologies, and provide an environment for NCAR’s user community to evaluate their workloads in a cloud environment. HSS has also designed and developed an in-house cloud prototype by leveraging Docker as an infrastructure to provide HPC management and bare-metal provisioning for our DAV and private Cloud environment. Due to a requirement for a lightweight container for shared resources on Cheyenne, HSS developed and published to the open source community a cloud solution named Inception; a lightweight “container” runtime primarily targeting HPC environments. More detail appears in the Lead national CI engagements section of this annual report.

CISL Cloud Computing research and testing activities have been completed using free credits from Amazon, Penguin, and other cloud providers. CISL’s on-premise cloud computing environment is being built by utilizing decommissioned hardware that was purchased with Core funding from the NSF.