Advance data-centric research

Grand challenges of modeling the Earth System require the interpretation and transformation of geophysical data in many forms. These activities range from mining the Big Data problems associated with large numerical experiments to interpreting the wide range of small but vital historical data sets that document past climate and important geophysical processes. Also of note are the massive data archives from remotely sensed and in situ instruments. Accordingly, CISL takes a broad view of data including: traditional observations from instruments, outputs from models, and derived “data products” from analyses. In the past, HPC in the geosciences has focused on modeling. CISL research also focuses on how data transformation and data analysis also benefit from HPC resources.

Statistical emulation
Comparison between a statistical emulation of the scaling patterns and the actual ensemble members. These results are in the context of capturing the variation between different ensemble members produced by the NCAR LENS. The fields shown here are differences between individual model runs and statistical emulations of the mean scaling pattern. Here a non-stationary, spatial statistical model is fit to describe the variation among ensemble members. The top row of this figure are four fields simulated from this statistical model. For reference the bottom row is the first four ensemble members (out of 30 total). The goal is for the statistical realizations to capture the heterogeneity and spatial coherence of the actual model results. The statistical emulation will not match the model results exactly but rather should appear to come from the same distribution. Here it is difficult to tell the two rows apart, so the emulation is successful. Ideally a climate modeler should not be able to tell which row is the actual output and which row is the statistical copy. The top row reproduces the difference in variation over ocean and land and also generates properly sized coherent spatial correlations. One detail that is not reproduced is the longer horizontal correlation that we see over the ocean that tends to produce longer and narrower structures in the model fields. These results are significant because the statistical emulation can be computed much faster than additional runs of the model, so the ensemble size can be easily increased.

CISL’s data-centric view with a focus on high performance computing results in research that integrates different aspects of computational and mathematical science. For example, our research on large data assimilation problems combines algorithms for ensemble representation (e.g., ensemble Kalman filter) with statistical ideas for robustness and stability of the methods. Making regional climate experiments useful for impacts analysis has resulted in combining ideas from fitting statistical distributions with the specific needs for objective basis corrections of model output. The need for spatial statistics for large data sets has spurred approximations to standard Bayesian statistics that are suited to parallel computing. Finally, the research on data compression has involved blending “off-the-shelf” compression algorithms with the particular requirements and workflows that are encountered in climate model research. All of these areas require the additional research of developing algorithms and workflows that scale to the large parallel computational architectures available to the geosciences.

In general the need for innovative data analysis tools – especially for larger data volumes (e.g., Big Data) – is an implicit theme underlying nearly all of NCAR’s strategic imperatives.

Specifically, however, this data science research supports NCAR’s first strategic imperative as it relates to basic research in data science and also to the fourth subsection of Imperative 4. Both of these imperatives relate to the challenge of tailoring data methods to meet the unique needs of the geosciences community. This research also fits into the community model development that is central to NCAR’s Imperative 3. This activity has the important role of confronting models with observations and fashioning data products that use the best statistical methods for model validation and diagnostics. Finally, we note that providing the best data tools for analysis, efficient work flow, and data reduction is aligned with the goals of Imperative 4.

Some highlights for data science research during this period include:

  • The data assimilation research testbed (DART) is a research facility to implement new methods of data assimilation and also support geophysical research that depends on assimilation. The Manhattan release of DART during this period is significant as it allows the algorithms to scale to large state vectors. One important case is being able to handle combining observations with the Parallel Ocean Program (POP) model at 1/10th-degree resolution.

  • Features from the NCAR Large Ensemble Projects have been successfully emulated using spatial statistics techniques. The result is to give impact assessment models a richer distribution of climate variability beyond the model ensemble. The statistical modeling depends on new algorithmic approaches for representing Gaussian processes and also using the NCAR supercomputing environment for parallel data analysis.

  • A factorial experiment of 12 integrations (3 global models by 2 regional models by 2 resolutions) has been completed as part of the Co-ordinated Regional Climate Downscaling Experiment for North America (NA-CORDEX). Initial results indicate that for the U.S. midwest the climate projections tend to be more sensitive to the global driving model instead of other factors. As further support for NA-CORDEX, some runs at higher resolution (12 km) have been completed, and observation bias correction has been applied to much of the NA-CORDEX archive.

  • Detecting climate change signals in observational data requires a careful statistical model to quantify the statistical significance of climate patterns but is difficult to compute using standard Bayesian methods. The large size of the climate fields is reduced through a dimension reduction technique that makes the calculations manageable but also includes the uncertainty that is introduced by the dimension reduction.

Funding for these projects and related research is listed in each of the following sections.