Statistics for geophysical data and model experiments

Spatial statistics
A distributed implementation of multi-resolution approximation for very large spatial data. Conducting spatial statistics for very large data sets, such as satellite data, has been notoriously challenging due to the need to work with matrices that have the same dimensions as the size of the data. We make use of HPC infrastructure and develop statistical models that allow for efficient parallelization to overcome this computational bottleneck. One such method is the multi-resolution approximation, which is a based on a hierarchical system of basis functions of increasing spatial resolution. The idea is that finer levels are localized in their use of data and only rely on the parent node for global information. The figure shows an example of a five-level hierarchical structure. The areas highlighted in green show where data located within the grid box on the finest-resolution level (level at the bottom) impact coarser levels. The red crosses indicate the knot locations of the basis functions. At the finest resolution, the knot locations coincide with the data locations, which leads to the attractive feature that the statistical covariance for data within the same grid box is represented exactly and not approximated. This feature is in contrast to other statistical approximations that do not represent the data at its original level. This model specification allows for efficient parallelization by solving each grid box individually but still accounting for dependency at coarser resolutions. We can currently solve systems for over 30 million observations using Yellowstone’s Geyser cluster and are working toward solving applications with over 125 million data points such as high-resolution satellite observations of sea surface temperatures.

From its unique position within CISL and IMAGe, the Geophysical Statistics Project (GSP) has been a leader in research and training emphasizing the synergy between the geosciences and data science. The mission of GSP is to pursue basic methodological and theoretical statistical research for scientific problems arising in the geosciences and at NCAR. Based on GSP’s perspective within CISL it is natural to focus on developing data algorithms and data science tools to harness multi-core and high-performance computing environments to enhance capabilities for analyzing large datasets, and in particular, those involving spatial and spatial-temporal dependence. Further, GSP has a strong interdisciplinary training component supporting graduate students and postdoctoral visiting scientists. These young researchers are immersed in research activities that not only focus their skills as applied statisticians but also expose them to important geophysical applications and innovative computational resources.

GSP also has an active visitor program providing research opportunities for visiting faculty members from across the nation and abroad. One valuable asset is GSP’s membership as a node in the NSF-DMS-funded research network named Statistics for the Atmosphere and Ocean Sciences (STAMOS). STATMOS not only helps to fund visits of data scientists to NCAR but also provides a link to key university programs that train students in statistics for geophysical and environmental problems. Visitor programs – as well as the research and training aspects of GSP that emphasize the interaction between statistics and the geosciences – capture the goals of integration, innovation, and community building within the CISL Strategic Plan.

This program advances CISL’s strategic imperative to produce scientific excellence by leading the scientific community in adopting new computational methods and mathematical tools that enhance scientific research. More specifically, GSP supports CISL’s science frontier of developing innovative statistical design and analysis techniques to improve the efficiency and accuracy of model development and testing.

During FY2015, GSP researchers have been involved in numerous projects including NCAR staff and university collaborators. Some highlights include:

  • Developing spatial models for large data sets that lend themselves to parallelization. Spatial data sets derived from remotely sensed measurements are typically too large for standard statistical methods, and this research seeks to find alternative methods to handle these important types of observations. One strategy is to divide up the region of interest recursively and then define specific, multiresolution basis functions that are limited to each of these divisions. (See figure above.) The field of interest is expressed as a sum on these basis functions multiplied by coefficients. A statistical model for the basis coefficients that is similar to a spatial version of the Kalman filter results in a flexible method for representing a spatial process. Moreover, the multiresolution property of the model also means that the computations can be distributed efficiently across many processors. Breakthroughs in the size of data sets that can be tackled have been made with this approach, and the parallel implementation on Yellowstone is a new use of this facility.

  • Building software tools to analyze large spatial datasets. This includes LatticeKrig, a contributed package in the R statistical language, that can accelerate spatial analysis by factors of 10 or more using sparse matrix methods. This software was extended to periodic regions and also to inverse problems where the observation is a linear operator applied to the field of interest. This feature is helpful in inverting solar corona measurements to infer 3D features in the solar atmosphere and for combining fields of different resolutions. More broadly, GSP continues to use the LatticeKrig model as a substrate to develop theory and methodology for analyzing spatial and spatial-temporal data (including large datasets, non-stationary covariance functions, and multivariate spatial observations).

  • Statistical transformation
    This figure illustrates the need for statistics methods to combine geophysical data products of fields. Shown are four different versions of snow water equivalent (SWE) for part of the U.S. Rocky Mountain region and surrounding states (state outlines in gray). For analysis, the SWE values have been transformed with a power transformation and the mean fields for 18 years are shown in these image plots. Despite some common spatial patterns, the data products differ in resolution, smoothness, and their extremes. Besides the need to find a common field to summarize these four products, this example also shows the need to express the uncertainty suggested by the variation among the fields. The difficulty in a statistical solution is compounded by the large number of grid points representing North America and also because SWE follows a skewed, nonGaussian distribution.

    Combining heterogeneous data products. A common data problem in the geosciences is blending several different data products or observations of a physical quantity to produce a single estimate. Even when methods are available to combine fields, they often do not provide measures of uncertainty that reflect disagreement and errors among the individual data products. This research applied a Bayesian hierarchical model to show how data products for snow water equivalent (SWE) measurements could be blended into a coherent single field. The main idea is to create a statistical model that relates each data product, at different resolutions and quality to a single hidden field that represents the geophysical variable. This data layer is combined with a spatial model for the actual SWE field and also some priors that give initial information on statistical parameters. This framework then allows one, via Bayes’ Theorem, to find the distribution of SWE fields that are likely given the collection of data products. One novel feature of this work is to handle large gridded data products, and this was accomplished using the LatticeKrig model for spatial fields.

The GSP project is made possible through NSF Core funding, as well as grants through NSF’s Division of Mathematical Sciences (DMS1417), and a subaward from the University of Colorado at Boulder, 1551584.