Apply machine learning and statistical methods to HPC-scale problems

As mentioned in the introduction, CISL is interested in understanding how machine learning (ML) and artificial intelligence will influence the future of HPC applications and architectures and in applying machine learning to optimize the operation of HPC cyberinfrastructure itself. Each topic is represented in the following subsections.

Model emulation

Machine learning emulation of computationally intensive numerical simulation components presents an opportunity to accelerate our simulation codes in an era of slower hardware performance increases. CISL staff have collaborated with scientists across NCAR on three pilot projects to demonstrate the feasibility of machine learning emulators for representative use cases in climate, weather, and geospace models. They made significant progress in FY2019 in developing accurate machine learning emulators and testing them within different simulations. The three projects are described below.

Microphysics pilot project

CISL partnered with the Climate and Global Dynamics Laboratory to develop an emulator of rain-formation processes for the Community Earth System Model (CESM) bulk microphysics parameterization based on a computationally intensive spectral bin microphysics scheme. After performing a two-year training data generation run of the Community Atmosphere Model with the bin scheme, we trained a set of neural networks to predict the output of the scheme.

After validating that the trained neural net closely reproduced the bin scheme, we developed a Fortran neural network inference engine and incorporated it into CESM. We successfully ran the emulator within CESM for nine simulated years in current, future, and preindustrial climates. A comparison between the emulated and binned results is shown in Figure 1.

Comparison of annual mean cloud liquid water path
Figure 1: Comparison of annual mean cloud liquid water path (cloud thickness) between Community Atmospheric Model runs with the neural network emulator (top) and the TAU bin microphysics scheme (bottom).

Surface layer pilot project

CISL and the Research Applications Laboratory (RAL) developed a new machine learning surface layer parameterization from long records of observations and tested it within the Weather Research and Forecasting (WRF) model. We trained random forests and neural networks to estimate the components of momentum, sensible heat, and latent heat fluxes at the surface from near-surface measurements of wind, temperature, and moisture at sites in the Netherlands and Idaho. In offline tests, the machine learning models outperformed current approaches based on Monin-Obukhov similarity theory. We tested the machine learning parameterizations in idealized single-column WRF runs (Figure 2) to identify and correct for the effects of model feedbacks.

Surface layer pilot project
Figure 2: Time-height profile of temperature (top) and mixing ratio (bottom) in the boundary layer of a single column WRF run using the random forest surface layer parameterization.

Space weather pilot project

In partnership with the High Altitude Observatory, we developed deep convolutional neural networks (CNNs) for predicting the strength of geomagnetic storms caused by interplanetary coronal mass ejections (ICME) based on optical data of an erupting CME at the solar corona. As shown in Figure 3, each CNN input sample is three images that correspond to the intensity and angle of the linearly polarized light and the amount of circularly polarized light. Although they were grey-scale images, using the concept that they were three color channels of the same image produced excellent results, as there is a definite analog.

The results showed there is a strong correlation between circularly polarized light and being able to predict the strength or magnitude of a geomagnetic storm, and a weak correlation for linearly polarized light. An aim of this study, in addition to prediction, was to demonstrate the importance of circularly polarized light as there are few telescopes that capture it because of the large lens required.

Space weather pilot project
Figure 3: An example of one input sample into the CNN. Left: Intensity of the linearly polarized light. Center: The angle of the linearly polarized light. Right: The amount of circularly polarized light. There are 36,288 input samples for a total of 108,864 images.

Machine learning of severe weather impacts

Hailstorm analysis and prediction

We used deep learning to identify storm structures that are associated with large hail. From a large archive of NCAR 3-km-grid WRF output, we trained convolutional neural networks to predict the probability of simulated severe hail. The CNNs outperformed traditional machine learning methods on this task. We also applied deep learning interpretation techniques to identify what storm features maximize the probability of severe hail. The neural network storm features are consistent with those identified by hail experts.

In addition to analysis, we collaborated with scientists at the University of Oklahoma to further refine a machine learning hail prediction system and run it in real time in the spring and summer of 2019. The results were published in Monthly Weather Review and publicized in multiple media outlets, including The Washington Post and Channels 7 and 9 in Denver. 

Hurricane intensification

In collaboration with RAL scientists, the National Hurricane Center, and MIT, we are developing a deep learning model to predict the rapid intensification of hurricanes. We are using a three-year archive of reforecasts from the operational model Hurricane WRF (HWRF) from 2015 to 2017 along with realtime model runs from the 2018 season to train and evaluate the model. We have trained models to predict both the intensity of a storm and how much it will change from HWRF model output fields. Development is ongoing, with real-time testing set for the 2020 hurricane season. 

Machine learning applied to HPC systems

Exploring the use of machine learning to predict file reuse 

The ability to predict which files are likely to be reused in a multi-tier storage system could help reduce storage costs while still meeting overall performance needs. Files that are more likely to be reused can be kept on faster and more expensive storage while files that are less likely to be reused can be kept on slower and less expensive storage. For example, there is no need to keep a file that will not be accessed in more than a year on a high-end storage system. Thus, information about file reuse could be used to help size the storage tiers.

CISL began studying file reuse patterns in our parallel file systems to explore the possible benefits of using machine learning to predict file reuse. We implemented scripts to collect file reuse data on a continuing basis and have begun the initial exploration of those data. Next steps will be to continue that exploration, create statistical and visual summaries of the reuse patterns, and begin determining if ML algorithms can help in predicting reuse.

Using k-means clustering to detect anomalous file removes

CISL previously developed a k-means clustering algorithm to detect anomalous file removes on our archival storage system in order to enhance our cybersecurity. In FY2019, we deployed the algorithm and associated software scripts into full production. As a result, alerts are triggered if suspicious file removal activity is detected on the system.

Scheduler optimization

CISL is exploring incorporating scheduler optimizations based on machine learning analysis of job usage information to predict job runtimes more accurately and schedule jobs more efficiently. Using a scheduler simulator coupled with a soft wall-time concept, we have demonstrated that this approach has the potential to extract more productive node hours. Work is under way to improve the efficiency of the scheduler optimization for use in a production environment.

The runtime users request for jobs they submit plays a crucial role in the efficiency of HPC systems. While underestimating runtime can result in the termination of jobs before completion, overestimation can result in overly long queuing of other jobs. Predicting runtime in HPC is challenging due to the complexity and dynamics of running workloads. Most current predictive runtime models are trained on static workloads, which poses a risk of overfitting the predictions with bias from the learned workload distribution. CISL proposed using an adaptation of the correlation alignment method in our deep neural network architecture to alleviate the domain shift between workloads for better runtime predictions. Experiments on benchmark workloads and on real-time NCAR production workloads reveal that our proposed method results in a more stable training model across different workloads with low accuracy variance as compared to other state-of-the-art methods. A poster for this work was accepted for display at SC19.

Ensemble consistency tests and compression

Lossy data compression

Significant increases in computational power in recent years have enabled the Community Earth System Model to run with finer resolutions and higher throughput, resulting in increasingly larger data volumes that strain institutional storage resources. These storage limitations negatively impact science objectives by forcing scientists to run fewer or shorter simulations and/or output data less frequently. Lossy data compression offers a potential means of mitigating the data volume problem, but by its very definition is unable to exactly reproduce original values.

Striking a balance between meaningfully reducing data volume and preserving the integrity of the simulation data is a critical and non-trivial task. CISL is focused on gaining user acceptance of lossy compression via careful analysis and testing. As part of our ongoing effort to develop an appropriate suite of measures for detecting compression-induced artifacts in varied analysis settings, we recently evaluated a number of image-quality metrics since data visualization is particularly important to climate scientists interacting with model output. We conducted a forced-choice visual evaluation study with climate model data and used statistical models to suggest the most appropriate measures and thresholds for evaluating whether images generated from the compressed model data are noticeably different from images based on the original model data.

We are also developing tools that use a variety of statistical methods at different spatiotemporal scales to highlight compression-induced features that would go unnoticed by standard data-compression metrics. Our goal is to address the issues that arise from this analysis together with compression algorithm development teams. We recently began a promising collaboration with developers of a state-of-the-art lossy compressor and are working on improvements to address issues of importance to the climate community. While more analysis is needed, we are building trust in the climate community by providing scientists with easy-to-use methods for investigating the effects of lossy compression.

Ensemble consistency testing

The Community Earth System Model (CESM) features a large and complex code base that is constantly evolving to support a range of experimental configurations, to include new science features, and to take advantage of modern computing platforms. Maintaining model confidence and reliability via software quality assurance is both non-trivial and critical as simulation output may be used to guide societal responses to the changing climate. The CESM Ensemble Consistency Test (CESM-ECT) was developed as a flexible but objective method for checking statistical consistency between an accepted ensemble of climate simulations and new simulations created with updated code or within a new computational infrastructure. CESM-ECT uses a testing framework based on the popular technique of principal component analysis to determine if a set of new simulations is statistically distinguishable from the established ensemble of climate simulations. 

While the CESM-ECT has proven useful for detecting the presence of a discrepancy, systematically determining the root causes leading to a discrepancy has turned out to be a formidable challenge. To this end, CISL developed a strategy to reduce the search space to a tractable size so runtime variable sampling is feasible. This process has shown promise in terms of locating error sources on multiple examples of CESM simulation output. Though much remains to be done, our goal is to provide CESM developers with tools to identify and understand the reason for statistically distinct output. We also are using CESM-ECT to assist in evaluating correctness for code optimizations. For example, a recent encouraging effort involves converting double-precision computations to single precision in physics parameterizations such as the atmospheric microphysics code, which models cloud processes.