Accelerating data analysis

NSF and other agencies now broadly recognize Big Data as a paramount challenge across science and engineering. Models such as the Community Earth System Model (CESM) have seen dramatic improvements in their performance and in the associated volume of data they produce. At the same time, we have not seen nearly the same progress from our processing, analysis, and visualization tools, which are generally single-threaded and sometimes limited to 32-bit addressing. In addition to emphasizing the hardware cyberinfrastructure (CI) side of the data analysis problem with NWSC resources, CISL is engaged in several activities aimed at exploring the requirements and developing new strategies for the software CI side of the equation.

CISL was heavily involved in the CGD-led process of preparing CMIP5 data for publication into the Earth System Grid for subsequent community use. The existing tools and workflows currently in use are fundamentally serial and seriously inadequate. CISL addressed these deficiencies in collaboration with CGD by developing and releasing several Python-based parallel post-processing tools to the CESM community. A joint project between CGD, IMAGe, and ASAP developed a method to evaluate the use of lossy data compression by analyzing the internal variability of a large ensemble of CESM runs.

We have released compressed climate data to the entire climate community by augmenting the CESM-CAM5 Large Ensemble (CESM-LE) community project with several ensemble members that have been compressed 5-to-1. We are actively soliciting feedback from the climate modeling community as to whether the effects of lossy compression are differentiable from the natural variability of the ensemble.

CMIP5 and data services research and development are supported by NSF Core funds. The new NSF-supported work in model data processing is supported by special award AGS-0856145.