Evaluate data compression for scientific data

Compression comparison
Horizontal contour plot of DJF (December–January-February) means for surface pressure (PS) from 2006 to 2099 for ensemble member 31 from the CESM Large Ensemble Community Project. The data in the top subplot have undergone lossy compression (i.e., 31-C) and the bottom subplot contains the original data.

Earth System Model (ESM) simulations generate large data volumes, and retaining the data from these simulations often strains institutional storage resources and budgets. In fact, high-resolution simulations are increasingly constrained by the amount of data that they generate rather than by computational resources. For example, limited storage capacity and I/O bandwidth negatively impact science objectives by forcing reductions in data output frequency, model resolution, simulation length, or ensemble size. CISL is exploring various ways to mitigate climate simulation data volumes, including the use of data compression techniques on data from the CESM.

Lossy data compression has the potential to significantly reduce data volumes at NCAR. Unlike lossless compression, lossy compression fails to exactly reproduce the original data, but it is capable of far greater data reduction than lossless techniques when applied to floating-point data. CISL researchers have studied the impact of lossy data compression on a variety of ESM data sets, including publicly available climate data by collaborating with climate scientists and requesting their feedback. This study showed that while detecting compression effects in features important to climate scientist is possible, these differences are often unimportant or disappear in post-processing analyses. This study was conducted using one specific compression method, but many types of lossy compression techniques are available with different strengths and weakness, and determining the optimal lossy compression methods for a wide variety of geoscience data is critical. The ultimate goal of this work is to be able to apply lossy compression to CESM data in a manner that both maximizes data reduction and preserves the scientific value of the data. To facilitate this task, an automated tool will be developed to determine the best compression method by variable, and to verify that it does indeed meet all quality metrics, so we can ultimately integrate lossy compression into the automated CESM workflow.

This data reduction effort is aimed at ensuring that NCAR’s substantial investments in HPC resources produce significant returns. The work directly supports CISL’s strategic plans in advanced applied computational science research calling for “pursuing novel, aggressive data compression techniques that have the potential to substantially reduce the storage and bandwidth needed for numerical experiments.” The work also aligns well with strategic plan goals for providing the NCAR user community with Big Data services aimed at analyzing a variety of data sets, including goals in the area of advancing data-centric research by exploring and developing visualization approaches that can handle the very large volumes of data that are increasingly common in the geosciences.

In FY2017 CISL researchers continued to carefully evaluate different types of lossy compression approaches on CESM data. Because care must be taken to ensure that science results are not impacted, choosing appropriate lossy compression algorithms and parameters is not trivial given the diversity of data produced by CESM. In particular, an understanding of both the attributes of the data and the properties of the chosen compression methods are needed. A paper by Baker et al., that discusses the properties of two distinct approaches for lossy compression (transform and predictive) in the context of CESM and that demonstrates the different strengths of each was accepted to the the first International Workshop on Data Reduction for Big Scientific Data (and will appear in the Lecture Notes in Computer Science series). This work motivates the development of an automated multi-method approach for compression of climate model output. In particular, we are actively working on techniques to predict compression method and level based on measurable CESM variable features, and we plan to submit a paper on that research in early FY2018.

Our previous work in providing climate scientists with direct experience with climate data that has undergone lossy compression was a successful step toward compression being incorporated into the CESM workflow. In fact, data should be initially written to disk in compressed form, either losslessly or lossily. To this end, collaborative efforts began this year to make more compression algorithms (lossy, in particular) accessible, e.g., via NetCDF filters, so that reading and writing compressed data is easier. Many hurdles exist, such as writing in parallel with compression, but collaborations with Unidata and CISL’s pyNIO developers are important steps forward.

CISL researchers also secured a third year of funding from an NSF SI2 grant that was used to enhance NCAR Command Language (NCL) support for reading and writing WASP data files. WASP is a lightweight wrapper around the NetCDF API that defines a protocol for storing lossy data in NetCDF files. The NCL extensions are aimed at facilitating the inclusion of data compression in existing workflows. In FY2018 we will work with NCAR domain scientists to evaluate the new capability.

CISL researchers also implemented and evaluated performance of wavelet-based compression on GPUs. The work led to a publication by Li et al.

CISL’s scientific data compression research is supported by NSF Core funds. Work on NCL and the WASP API was funded via a subaward from the University of California at San Diego, NSF grant 54067252.