Apply machine learning and statistical methods to HPC-scale problems

CISL staff embarked on multiple projects applying machine learning and statistical methods to a variety of HPC-scale problems in FY2018. Machine learning techniques consist of mathematical models that identify patterns in large data sets and use those patterns to make predictions. It is essential to investigate the application of machine learning at HPC scales; it may be the key to overcoming stagnating model performance, reducing the cost of model parameter tuning, accelerating the analysis of data-intensive workflows, and enhancing the extraction of scientific knowledge from large or complex data sets. CISL is employing machine learning and statistical models to emulate computationally intensive software systems and to diagnose features in weather and climate simulations as well as issues with the HPC hardware and software that drive those simulations. Ultimately, these projects help support and accelerate the science performed at NCAR.

Model emulation

Numerical weather and climate simulations cannot resolve every relevant physical process. These unresolved processes are approximated with sub-grid parameterization schemes, which require either large computational resources or significant simplifications of the original process. Machine learning models that emulate the behavior of parameterization schemes at a fraction of their computational cost have shown some promise but are not widely used in atmospheric modeling to date. CISL’s Analytics and Integrative Machine Learning (AIML) group was formed in FY2018 with the initial task of investigating the feasibility of emulation in three projects.

In one project, machine learning models emulated the process for converting cloud droplets to raindrops, which has a large effect on the evolution of clouds and rainfall. A climate model ran for two years with the complex microphysics parameterization scheme, and the machine learning model learned to approximate the relationship between the inputs and outputs of the parameterization. For another project, machine learning models emulated the relationship between surface weather data and the flux of energy between land and the lower atmosphere. In a third project, neural networks learned to estimate the severity of coronal mass ejections from the Sun when they reach Earth. Physical models of coronal mass ejections cannot run fast enough to estimate their severity before the coronal mass ejection reaches Earth, so the neural network emulator could provide advance warning for the government to protect vulnerable satellites and astronauts.

Machine learning of severe weather impacts

Machine learning models can also emulate the relationship between physical model output and observed severe weather impacts. Convolutional neural networks have been trained to identify spatial patterns in simulations of thunderstorms and relate them to the size of hail at the surface. CISL compared such networks with other, less complex spatial machine learning methods and found the networks outperformed those methods in terms of accuracy and reliability. Neural network interpretation techniques allow visualization of the patterns learned by the neural network (Figure 1). Propagating information backward through a trained neural network to the input level reveals what features the network associates with the perfect hailstorm. A CISL network identified multiple storm structures, such as wind shear and instability, that meteorologists have associated with large hail growth. The network also learned to capture patterns across multiple fields and ensure that they are physically consistent with each other.

Hailstorm visualization
Figure 1. A “dream” hailstorm generated by a convolutional neural network trained to predict severe hail. The filled contours indicate geopotential height anomalies (red is positive). Red contours indicate temperature, green contours indicate dewpoint, and the arrows show wind direction. The subplots show different pressure levels, or slices from (left) higher to (right) lower levels of the storm.

Machine learning applied to HPC systems

One of the purposes of a data archive is to preserve irreplaceable data for future studies and generations. Data can be lost a number of ways from an archive, including accidental or malicious deletion. While there is software that can check for specific known threats or problems, detecting non-specific anomalous behavior such as unusual file removal patterns on a system is harder. To help protect the 87 petabytes of data in NCAR’s tape-based data archive, in FY2018 CISL explored file removal patterns and implemented a k-means clustering solution to detect anomalous file removes and send alerts. The machine learning algorithm builds a statistical model of what constitutes normal behavior and then flags outliers.

In another application, CISL’s High-end Services Section (HSS) used machine learning to help automate tape technology migrations. Migrating large quantities of data in an archive from one tape generation to the next can take months to complete and can interfere with regular user writes and reads. If the data migration occupies too many resources, users’ data transfers wait too long; if too few resources are used, the technology migration can take too long. HSS implemented a simple algorithm that learns from the historical and current workload and dynamically adjusts to manage such migrations more efficiently.

The group also began exploring the use of scheduler optimizations based on machine-learning analysis to predict job runtimes more accurately and schedule jobs more efficiently. A scheduler simulator demonstrated that this approach has the potential to extract more productive node-hours. Work is under way to improve the efficiency of scheduler optimization for use in a production environment. HSS also developed a utility that predicts up to 50% of observed node failures up to 30 days in advance. The concept has the potential to predict node failures and enable the pre-emptive removal of bad hardware from service, which would improve overall rate of job completion success.

Ensemble consistency tests and compression

To address the need for quality assurance in the continued development and improvement of the Community Earth System Model (CESM), CISL created the CESM Ensemble Consistency Test (CESM-ECT) in FY2018 as a statistical test for consistency between experimental outputs and an accepted ensemble. The CESM-ECT provides rapid feedback to model developers, scientists, and end users without expert knowledge of climate science. Ongoing development focuses on locating sources of statistical inconsistency and completing the framework for providing full-featured CESM quality assurance through consistency testing and error source identification. CISL made additional progress in the fiscal year on analyzing data compression for climate data to satisfy concerns in the climate community regarding data compression-induced artifacts. Collaboration continues with developers of compression algorithms to work toward incorporating CISL metrics and statistical evaluation methods in the algorithm improvement cycle.