Traditional tools for data analysis in the geosciences are effective for homogeneous data sets and summarizing unambiguous features. They tend to break, however, when applied to combine sparse and heterogeneous observations or to quantify the uncertainty in estimated features or in derived fields. Bayesian statistical models provide a framework to produce blended and gridded data products and also to attach uncertainty measures to the results. These tools from statistical science, however, must be adapted to handle the larger data volumes typical in geophysical problems.

This work fills a vital need to interpret geophysical data using contemporary statistical science and drawing on machine learning from computer science. The benefits include better insight into hidden features of large data volumes and improved interpretation through adding measures of uncertainty. It supports CISL’s strategic goal to enhance the effective use of current and future computational systems. Specifically, this work strategically combines applied statistical science with supercomputing resources to sustain progress in the Earth System sciences.

The accomplishments described below are some of projects that have been advanced during this period, and they illustrate the diverse ways that data science is applied to NCAR’s scientific problems.

Causal discovery is an area of machine learning that identifies potential cause-effect relationships from data and is used to learn so-called causal signatures from data. It is applied to climate model simulations that indicate interactions between different geophysical variables. During FY2016 this approach was refined for distinguishing among climate model runs and provided a different measure than the usual comparisons based on mean climate fields. These techniques can also quantify the impact of data compression on the causal signatures to determine which type and amount of compression is acceptable.

Kriging is a well-known method used in geostatistics to estimate how climate varies over a geographic region when the observational data is sparse or the computer model runs are limited. This research has resulted in spatial methods for large sample sizes using both new numerical algorithms that exploit sparse matrix methods (LatticeKrig) as well as approaches that use linear algebra libraries optimized for coprocessors such as GPUs. It also adapts the ideas of a multi-resolution representation to the distributed data constraints. This work produced 10x-100x computational speedups and makes spatial data analysis possible for geoscience problems that would have been intractable using traditional methods. During this period the LatticeKrig package was extended to include a spherical geometry to model spatial data of global extent.

Applying data science techniques to the analysis of numerical models is useful where a complex geophysical model is expensive to run under many different inputs, but can be approximated with a statistical emulator. During this period an emulator method was successful in representing a wider range of climate model responses than the parent numerical model ensemble. Specifically, pattern scaling was used to extend surface temperature results of an NCAR/DOE CESM1 large initial-condition ensemble from a high-emissions scenario (RCP8.5) to a lower-emissions scenario (RCP4.5). Significantly, the emulation also reproduces the internal variability in the temperature fields along with uncertainty in the scaling pattern. The uncertainty in the scaling pattern itself is represented by a Gaussian random field where the covariance parameters can be estimated from the ensemble model output.

Statistical compression of climate data

Numerical climate model simulations at high spatial and temporal resolutions generate massive quantities of data. As our computing capabilities continue to increase, storing all of the data is not sustainable, and thus it is important to develop methods for representing the full data sets by smaller, compressed versions. During FY2016 a statistical compression and decompression algorithm was developed based on storing a set of summary statistics as well as a statistical model describing the conditional distribution of the full data set given the summary statistics. This approach is distinct from more deterministic representations of the fields using conventional compression technology. The statistical model can be used to generate realizations representing the full data set, along with characterizations of the uncertainties in the generated data. Considerable attention is paid to accurately modeling the original data set, particularly with regard to the inherent spatial nonstationarity in global fields, and to determining the statistics to be stored, so that the variation in the original data can be closely captured. Moreover, the method allows for fast decompression using parallelization on HPC systems and conditional emulation on modest computers.

Funding

Research in the Geophysical Statistics Project is made possible through NSF Core funding and NSF DMS award 1406536.