Develop statistical methods to interpret data and improve models

Traditional tools for data analysis in the geosciences are effective for homogeneous data sets and summarizing unambiguous features. They tend to break, however, when they are applied to combine sparse and heterogeneous observations or to quantify the uncertainty in estimated features or derived fields. This work fills a vital need to interpret geophysical data using contemporary statistical science and drawing on machine learning from computer science. The benefits include better insight into hidden features of large data volumes and improve interpretation by adding measures of uncertainty. It supports CISL’s strategic goal to enhance the effective use of current and future computational systems. Specifically, this work combines applied statistical science with supercomputing resources to sustain progress in the Earth System sciences.

The following accomplishments are some of projects that have been advanced during FY2017, and they illustrate the diverse ways that data science is applied to scientific problems at NCAR.

Uncertainty quantification and statistical emulators. Applying data science techniques to the analysis of numerical models is useful where a complex geophysical model is expensive to run under many different inputs but can be approximated with a statistical emulator. During FY2017 this project focused on representing the uncertainty in the pattern scaling method that is calculated from the CESM large ensemble (LENS) under RCP8.5. The uncertainty in the scaling pattern itself is represented by a Gaussian random field where the covariance parameters can be estimated from the ensemble model output. We use the LK model to estimate local covariance parameters for the scaling pattern. Because of the efficiency of sparse matrix methods connected with the LK approach, simulation of the non-stationary Gaussian process is possible and gives a speedup that is two orders of magnitude faster than conventional algorithms. To our knowledge this is the first statistical emulation of a large and non-stationary geophysical field.

Statistical methods for large spatial data. Kriging is a well-known method used in geostatistics to estimate how climate varies over a geographic region when the observational data is sparse or the computer model runs are limited. A key feature of Kriging is determining the covariance function, a measure of how observations are correlated as a function of their separation distance.

ABC method
Depicted are the approximate posteriors derived from an ABC method applied to estimate the covariance parameters in a simulated spatial data set. Here lambda is a ratio of spatial noise to signal variance and aw is a parameter that determines the correlation range. Estimates are based on approximately 1,300 irregular observations in a square domain. The first row shows the ABC results from a uniform prior (gray line) and the posterior (black line) based on a naïve equal weighting of the variogram statistic. Compared to the true values (red dashed line) these results are poor and suggest the method is biased. The second row are the results for these parameters weighting the variogram statistic by its variance in different bins. This weighting is difficult to derive analytically but is possible to find because a simulation-based method is used to determine the variogram statistic. Here we see the ABC posterior provides a reliable estimate of the unknown parameters. Although not shown in this figure, this method also indicates that the spread of the approximate posterior is a reliable characterization of the uncertainty. This method is important because ABC will allow for the analysis of spatial data several orders of magnitude larger than using standard Bayesian algorithms for finding a posterior.

This project uses new ideas in approximate Bayesian computation (ABC method) to estimate a covariance function for large data sets. We have successfully used statistics based on the variogram as the core of a Markov chain Monte Carlo algorithm to find covariance parameter estimates along with measures of uncertainty. The key is that the variogram statistic is much faster to compute than the determinant and inverse of a covariance matrix. In addition, the LatticeKrig (LK) model for spatial data has been formulated to be fast to simulate large sample sizes, and this allows for a simulation-based description of the variogram distribution.

Timing study
Timing study for spatial analysis on Cheyenne. Plotted are times for different components of a spatial analysis that involves fitting a spatial model to small windows around 1,000 grid boxes. The blue points are the times (in seconds) to complete the computations, orange points are the times to create the worker R sessions, and green points are the times to broadcast the data set to the workers. These results indicate nearly linear scaling as the number of cores (also workers in this case) increases. Practically, these results indicate that 1,000 cores on Cheyenne is providing nearly a 1,000-fold speedup in completing these spatial statistics computations. The time to create the workers also rises linearly, and at 1,000 cores is still only about 10% of the total time. Surprisingly, broadcasting the 12-MB data set is a small fraction of the total time.

HPC4Stats framework. A common task in analyzing model output is fitting statistical models or performing other statistical operations on each grid cell or in moving windows. These tasks are typically embarrassingly parallel and provide an opportunity to accelerate the data analysis using a simple framework. CISL is harnessing the rich set of statistical functions and packages in the R data analysis environment and building the computational tasks using this environment.

The framework is designed to require minimal knowledge of HPC systems or UNIX and focuses on the user’s skill in R. Besides some simple batch scripting and changing directory pathnames, the same HPC4Stats application will run on a laptop or a supercomputer. The framework uses a supervisor/worker model where these are independent R sessions. Assigning tasks to workers, the computation, and assembling the results are all done in R, which minimizes the adjustment from a PC or cluster to an HPC environment.

The HPC4Stats framework has shown nearly linear scaling on Cheyenne up to 1,000 cores and so promises speedup of several orders of magnitude even for complex statistical computations.

Funding. This research is made possible by NSF Core funding. Workshops receive partial additional support from registration, sponsorship, and donations.