Traditional tools for data analysis in the geosciences are effective for homogeneous data sets and summarizing unambiguous features. They tend to break, however, when they are applied to combine sparse and heterogeneous observations or to quantify the uncertainty in estimated features or derived fields. This work fills a vital need to interpret geophysical data using contemporary statistical science and drawing on machine learning from computer science. The benefits include better insight into hidden features of large data volumes and improve interpretation by adding measures of uncertainty. It supports CISL’s strategic goal to enhance the effective use of current and future computational systems. Specifically, this work combines applied statistical science with supercomputing resources to sustain progress in the Earth System sciences.
The following accomplishments are some of projects that have been advanced during FY2017, and they illustrate the diverse ways that data science is applied to scientific problems at NCAR.
Uncertainty quantification and statistical emulators. Applying data science techniques to the analysis of numerical models is useful where a complex geophysical model is expensive to run under many different inputs but can be approximated with a statistical emulator. During FY2017 this project focused on representing the uncertainty in the pattern scaling method that is calculated from the CESM large ensemble (LENS) under RCP8.5. The uncertainty in the scaling pattern itself is represented by a Gaussian random field where the covariance parameters can be estimated from the ensemble model output. We use the LK model to estimate local covariance parameters for the scaling pattern. Because of the efficiency of sparse matrix methods connected with the LK approach, simulation of the non-stationary Gaussian process is possible and gives a speedup that is two orders of magnitude faster than conventional algorithms. To our knowledge this is the first statistical emulation of a large and non-stationary geophysical field.
Statistical methods for large spatial data. Kriging is a well-known method used in geostatistics to estimate how climate varies over a geographic region when the observational data is sparse or the computer model runs are limited. A key feature of Kriging is determining the covariance function, a measure of how observations are correlated as a function of their separation distance.
This project uses new ideas in approximate Bayesian computation (ABC method) to estimate a covariance function for large data sets. We have successfully used statistics based on the variogram as the core of a Markov chain Monte Carlo algorithm to find covariance parameter estimates along with measures of uncertainty. The key is that the variogram statistic is much faster to compute than the determinant and inverse of a covariance matrix. In addition, the LatticeKrig (LK) model for spatial data has been formulated to be fast to simulate large sample sizes, and this allows for a simulation-based description of the variogram distribution.
HPC4Stats framework. A common task in analyzing model output is fitting statistical models or performing other statistical operations on each grid cell or in moving windows. These tasks are typically embarrassingly parallel and provide an opportunity to accelerate the data analysis using a simple framework. CISL is harnessing the rich set of statistical functions and packages in the R data analysis environment and building the computational tasks using this environment.
The framework is designed to require minimal knowledge of HPC systems or UNIX and focuses on the user’s skill in R. Besides some simple batch scripting and changing directory pathnames, the same HPC4Stats application will run on a laptop or a supercomputer. The framework uses a supervisor/worker model where these are independent R sessions. Assigning tasks to workers, the computation, and assembling the results are all done in R, which minimizes the adjustment from a PC or cluster to an HPC environment.
The HPC4Stats framework has shown nearly linear scaling on Cheyenne up to 1,000 cores and so promises speedup of several orders of magnitude even for complex statistical computations.
Funding. This research is made possible by NSF Core funding. Workshops receive partial additional support from registration, sponsorship, and donations.