Develop statistical methods to interpret data and improve models

Traditional tools for data analysis in the geosciences are effective for homogeneous data sets and summarizing unambiguous features. They tend to break, however, when applied to combine sparse and heterogeneous observations or to quantify the uncertainty in estimated features or in derived fields. Bayesian statistical models provide a framework to produce blended and gridded data products and also to attach uncertainty measures to the results. These tools from statistical science, however, must be adapted to handle the larger data volumes typical in geophysical problems.

This work fills a vital need to interpret geophysical data using contemporary statistical science and drawing on machine learning from computer science. The benefits include better insight into hidden features of large data volumes and improved interpretation through adding measures of uncertainty. It supports CISL’s strategic goal to enhance the effective use of current and future computational systems. Specifically, this work strategically combines applied statistical science with supercomputing resources to sustain progress in the Earth System sciences.

The accomplishments described below are some of projects that have been advanced during this period, and they illustrate the diverse ways that data science is applied to NCAR’s scientific problems.

Climate informatics

Causal discovery is an area of machine learning that identifies potential cause-effect relationships from data and is used to learn so-called causal signatures from data. It is applied to climate model simulations that indicate interactions between different geophysical variables. During FY2016 this approach was refined for distinguishing among climate model runs and provided a different measure than the usual comparisons based on mean climate fields. These techniques can also quantify the impact of data compression on the causal signatures to determine which type and amount of compression is acceptable.

Statistical methods for large spatial data

Kriging is a well-known method used in geostatistics to estimate how climate varies over a geographic region when the observational data is sparse or the computer model runs are limited. This research has resulted in spatial methods for large sample sizes using both new numerical algorithms that exploit sparse matrix methods (LatticeKrig) as well as approaches that use linear algebra libraries optimized for coprocessors such as GPUs. It also adapts the ideas of a multi-resolution representation to the distributed data constraints. This work produced 10x-100x computational speedups and makes spatial data analysis possible for geoscience problems that would have been intractable using traditional methods. During this period the LatticeKrig package was extended to include a spherical geometry to model spatial data of global extent.

Uncertainty quantification and statistical emulators

Applying data science techniques to the analysis of numerical models is useful where a complex geophysical model is expensive to run under many different inputs, but can be approximated with a statistical emulator. During this period an emulator method was successful in representing a wider range of climate model responses than the parent numerical model ensemble. Specifically, pattern scaling was used to extend surface temperature results of an NCAR/DOE CESM1 large initial-condition ensemble from a high-emissions scenario (RCP8.5) to a lower-emissions scenario (RCP4.5). Significantly, the emulation also reproduces the internal variability in the temperature fields along with uncertainty in the scaling pattern. The uncertainty in the scaling pattern itself is represented by a Gaussian random field where the covariance parameters can be estimated from the ensemble model output.

Pattern scaling
Estimated pattern scaling for surface temperature based on the CESM large ensemble. This panel of plots summarizes the results of estimating the local temperature response to a global temperature change and also quantifies the uncertainty in this estimate using a spatial statistics model (Matern Gaussian Process). This analysis is based on the CESM large ensemble (30 members) for RCP8.5 and at approximately 1 degree spatial resolution. The image plot in the upper left is the change in mean temperature (C) in each model grid box due to a one degree change in global mean temperature. Note that the changes are more over land and also increase for higher latitudes. The plot in the upper right is the expected standard deviation (C) in this pattern and reflects the variability in the mean pattern due to internal variability of the climate system as represented by the ensemble experiment. This variability was found to have some spatial coherence, and the plots in the lower left and right are the result of estimating a local spatial covariance function based on the Matern covariance with smoothness of 1.0. The correlation ranges indicate higher correlation in the pattern over the ocean and less over the land. Moreover the signal-to-noise variance shows that patterns over land also have a larger component that is uncorrelated over space. These distinctions are important in simulating the variability in the pattern and are valuable for integrated assessment models (IAM). Based on this analysis, a much simpler climate model run could determine the global mean temperature under different scenarios of human activity, then use simulated pattern scaling fields to infer local changes in the mean temperature. This technique is much less costly in computer resources than running CESM for other scenarios and large ensembles.

Statistical compression of climate data

Numerical climate model simulations at high spatial and temporal resolutions generate massive quantities of data. As our computing capabilities continue to increase, storing all of the data is not sustainable, and thus it is important to develop methods for representing the full data sets by smaller, compressed versions. During FY2016 a statistical compression and decompression algorithm was developed based on storing a set of summary statistics as well as a statistical model describing the conditional distribution of the full data set given the summary statistics. This approach is distinct from more deterministic representations of the fields using conventional compression technology. The statistical model can be used to generate realizations representing the full data set, along with characterizations of the uncertainties in the generated data. Considerable attention is paid to accurately modeling the original data set, particularly with regard to the inherent spatial nonstationarity in global fields, and to determining the statistics to be stored, so that the variation in the original data can be closely captured. Moreover, the method allows for fast decompression using parallelization on HPC systems and conditional emulation on modest computers.

Compression effects
Effect of compression on log contrast variances in daily temperature fields. This figure illustrates the compression for daily surface temperature fields from one member of the CESM large ensemble experiment and for (model) year 2081. For each day, adjacent grid or time point values are differenced. The log variance of these differences across the year is reported. This is a stringent test of the compression veracity and measures how well local variability and spatial structure in the fields are preserved by the compression scheme. The first column shows average North-South log contrast variances, the middle column shows east-west log contrast variances, and third column shows one-step temporal contrast variances. The first row is computed from the original data, and the remaining rows represent compressions of 5:1, 10:1, and 20:1 respectively. Based on this metric, the spatial and temporal properties of the compressed data are almost indistinguishable from the original data, especially at the 5:1 level.


Research in the Geophysical Statistics Project is made possible through NSF Core funding and NSF DMS award 1406536.