Evaluate data compression for scientific data

Tornado wind velocity
Renderings of isosurfaces of the z-component of velocity from a numerically simulated tornado. The left figures are of the entire data set, while the right figures are enlargements of one feature. Dashed lines in the left figures indicate the enlarged region. The top row of figures are from the original (raw) data; the middle row is from 128:1 compressed data using spatial domain compression only; and the bottom row is from 128:1 compressed data using both spatial and temporal compression.

Due to a combination of factors that include both diverging rates of technological advancement and cost, our ability to compute data has outstripped our ability to effectively store, manage, and analyze it. This trend is expected to continue for the foreseeable future as evidenced, for example, by specifications for exascale computing platforms that are anticipated to arrive near the end of the decade. Computational scientists, such as those studying climate, whose research depends on ever-increasing model resolution and complexity for improved understanding are facing daunting challenges posed by both limited storage capacity and I/O bandwidth. In response to these issues, CISL is exploring a variety of novel hardware- and software-based solutions.

Two related research areas with a high-impact potential are lossy data compression and progressive data access. Lossy compression, unlike lossless compression, fails to exactly reproduce original data, but is capable of far greater reduction than lossless techniques when applied to floating-point data. Though not able to exactly reproduce the original data, the reconstructed results may be indistinguishable from the original by many salient metrics. Progressive data access, on the other hand, offers progressive reconstruction of original data in a manner that enables an investigator to make speed-quality tradeoffs that can be subsequently validated, as needed, with lossless reconstruction. The goals of this work are to:

  • Determine whether, and to what degree, scientific data sets can tolerate information loss

  • Investigate a variety of compression methods and their suitability for geoscience data

  • Develop user tools for data compression and improved, more general, progressive data access

This work is aimed at ensuring that NCAR’s substantial investments in HPC resources produce significant returns. The work directly supports CISL’s strategic plans in advanced applied computational science research calling for “pursuing novel, aggressive data compression techniques that have the potential to substantially reduce the storage and bandwidth needed for numerical experiments.” The work also aligns well with strategic plan goals for providing the NCAR user community with Big Data services aimed at analyzing a variety of data sets, and goals in the area of advancing data-centric research by exploring and developing visualization approaches that can handle the very large volumes of data that are increasingly common in the geosciences.

In FY2016 CISL researchers continued earlier efforts to investigate applying lossy data compression to CESM output. In particular, this work builds on a 2014 paper by Baker et al., that introduces a methodology for evaluating the impact of compression on climate simulation data and examines a number of lossy compression techniques within this framework. The key idea in quantifying the impact of compression is that, to preserve the integrity of the climate simulation data, the effects of lossy data compression on the original data should, at a minimum, not be statistically distinguishable from the natural variability of the climate system. The preliminary work with data from CESM in Baker et al., 2014 indicates that this goal is attainable.

Therefore, our path forward has included both providing climate scientists with direct experience with climate data that has undergone lossy compression and extending the suite of previously explored compression strategies to include state-of-the-art wavelet encoders. In the former case, we conducted a blind experiment that engaged the wider climate community to evaluate the impact of lossy data compression on publicly available climate data from the CESM Large Ensemble Community Project. Climate scientists examined features of the data relevant to their interests and attempted to identify which of the ensemble members have been compressed and reconstructed. Overall, we found that while detecting distinguishing features is certainly possible, the compression effects noticeable in these features are often unimportant or disappear in post-processing analyses. While the original goal of this study was to convince climate scientists that using lossy data compression is both acceptable in terms of effects on scientific results and advantageous in terms of data reduction, the feedback that we received was invaluable in terms of informing future compression error metrics. A paper detailing this work has been accepted for publication in Geoscientific Model Development.

In the area of wavelet-based encoders, preliminary results suggest that these encoders compare favorably with the best compression techniques investigated previously. However, the story is complicated: different compression techniques appear to work better for different data variables. Therefore, this ongoing work involves expanding our original analysis of lossy compression in CESM to better understand the strengths and weaknesses of different varieties of compression algorithms (e.g., transform vs. predictive), and plans call for a follow-up paper to Baker et al., 2014 later this year. Ultimately the goal is to determine the most effective compression approach for each variable based on measurable properties of that variable.

CISL researchers also explored spatio-temporal compression of scientific data using wavelets. This research exploits the coherence that exists in both the spatial and temporal domains of many numerical simulations. The results of this work, conducted in collaboration with the University of Oregon, led to a manuscript that has been submitted for review to IEEE Transactions on Visualization and Computer Graphics.

CISL also completed work funded by an NSF SI2 grant – the Wavelet enabled Storage Access Protocol (WASP) – to develop a scientific file format, based on NetCDF, capable of supporting both lossy compression and progressive data access. The new WASP file format is the cornerstone of the VAPOR Data Collection, Version 3, and has also been adopted by the University of California at San Diego for their bio-imaging analysis platform, QUEST.

CISL’s scientific data compression research is supported by NSF Core funds. Development of the WASP API was funded via a subaward from the University of California at San Diego, NSF grant 54067252.