Advance data-centric research

Grand challenges of modeling the Earth System require the interpretation and transformation of geophysical data in many forms. These activities range from mining the Big Data problems associated with large numerical experiments and to interpreting the wide range of small but vital historical data sets that document past climate and important geophysical processes. Also of note are the massive data archives from remotely sensed and in situ instruments.

Accordingly, CISL takes a broad view of data including: traditional observations from instruments, outputs from models, and derived “data products” from analyses. In the past, HPC in the geosciences has focused on modeling. CISL research also focuses on how data transformation and data analysis also benefit from HPC resources.

Nodes on sphere
This graphic shows three levels from a multi-resolution node distribution for the sphere to be used for fitting remotely sensed CO2. The first three levels of nodes (green points) are used for a spatial statistics model on the sphere and a companion spatial analysis. This research represents the interplay, common in CISL, of developing new algorithms that exploit computational resources, address the specific needs of geophysical research, and are widely disseminated to the community as open source software. This green points locate the centers of basis functions that can represent global fields and global data sets in a way that preserves spherical geometry. For reference, the equator and prime meridian (red) are located on each plot, and the shaded triangles indicate how the different levels are related by subdividing each triangular face into four smaller triangles with three new points. The basis functions in this model consist of bump-shaped functions centered on these node points and are zero beyond a certain range. As the number of levels increases, more complex structure for a field on the sphere can be represented, and the LatticeKrig spatial method is based on this scheme for assigning node points. This approach is efficient for interpolating and smoothing data because it takes advantage of sparse matrix algebra throughout its computations. Preliminary timing results for the sphere suggests that for 10,000 observations, it can provide a factor of 10 or more speedup over standard methods.

CISL’s data-centric view with a focus on high performance computing results in research that integrates different aspects of computational and mathematical science. For example, our research on large data assimilation problems combines algorithms for ensemble representation (e.g., ensemble Kalman filter) with statistical ideas for robustness and stability of the methods. Making regional climate experiments useful for impacts analysis has resulted in combining ideas from fitting statistical distributions with the specific needs for objective basis corrections of model output. The need for spatial statistics for large data sets has spurred approximations to standard Bayesian statistics that are suited to parallel computing. Finally, the research on data compression has involved blending “off the shelf” compression algorithms with the particular requirements and workflows that are encountered in climate model research. All of these areas require the additional research of developing algorithms and workflows that scale to the large parallel computational architectures available to the geosciences.

In general the need for innovative data analysis tools – especially for larger data volumes (e.g. Big Data) – is an implicit theme underlying nearly all of NCAR’s strategic imperatives.

Specifically, however, this data science research supports NCAR’s first strategic imperative as it relates to basic research in data science and also to the fourth subsection of Impertive 4. Both of these imperatives relate to the challenge of tailoring data methods to meet the unique needs of the geosciences community. This research also fits into the community model development that is central to NCAR’s Imperative 3. This activity has the important role of confronting models with observations and fashioning data products that use the best statistical methods for model validation and diagnostics. Finally, we note that providing the best data tools for analysis, efficient work flow, and data reduction is aligned with the goals of Imperative 4.

Some highlights for data science research during this period include:

  • The Data Assimilation Research Testbed has been improved to be more scalable in terms of memory. This allows for ensemble data assimilation for models with larger state vectors and is important to address assimilation for higher-resolution versions of weather models such as WRF and MPAS, and for coupled climate models such as CESM.

  • The basic statistical computations needed for analyzing spatial data has been optimized to use multiple GPUs and processors. This parallelization can give a factor of 70 speedup for critical computational steps, and it supports a useful increase the size of data sets that can be analyzed using standard methods.

  • A parallel algorithm has been devised to address the statistical computations for finding signals of climate change in observational data. A novel feature of this work is to remove some of the arbitrariness of deciding how many basis functions (i.e., empirical orthogonal functions) are used to represent the climate signal.

  • Initial experiments have been completed as part the North American portion of the Co-ordinated Regional Climate Downscaling Experiment (NA-CORDEX). Runs exploring the robustness to different regional resolutions (25 km versus 50 km) suggest that the differences in the 50-km simulations driven by the two different global models are as great as the differences between the resolutions when a regional model (RegCM4) was driven by the MPI global model. This kind of factorial-designed experiment helps to set ranges on the uncertainty of regional climate projections based on a multi-model ensemble.

Funding for these projects and related research is listed in each of the following sections.

CO2 obs data
These two figures show the raw CO2 observations (left) typically of a remotely sensed data product and the estimated complete surface (right) using a spatial method based on the LatticeKrig model. Approximately 32,000 observations and 13,000 basis functions at four levels of resolution (3–6) are used in the statistical model.