Scientific data compression and visualizing large datasets

CISL is exploring a variety of hardware- and software-based approaches for addressing the challenges of storing, visualizing, and analyzing large data sets. CISL is also exploring two interrelated approaches to the challenges of large data sets. The first major software thrust in this area has been research, development, and experimentation with wavelet-based progressive-access data models for structured scientific data sets. Wavelets are the basis for numerous, ubiquitous multimedia compression technologies such as the JPEG 2000 image compression standard. However, unlike the “lossy” compression strategies used in consumer entertainment, our efforts are focused on level-of-detail techniques that offer perfect reconstruction of the original data while allowing the user to make speed/quality tradeoffs when performing interactive work. The second focus is to develop a method for determining how much information can be lost without impacting the results of typical climate analysis. The goals of all this work are to:

  • Determine whether, and to what degree, scientific data sets can tolerate information loss.

  • Investigate a variety of compression methods and their suitability for geoscience data.

  • Develop user tools for data compression and improved, more general, progressive data access.

Exponential growth in transistor density is producing ongoing increases in computer processing power. These increases enable computational scientists to create numerical simulations of physical phenomena at unprecedented scales, thus generating extraordinary amounts of data. Yet while microprocessor performance continues to advance in accordance with Moore’s Law, other computing technologies are improving at much more modest rates. In particular, storage and networking bandwidths have lagged behind. As a result, the challenge of storing, analyzing, managing, and sharing large simulation data sets is becoming ever more problematic. Moreover, large data visualization is a central component in petascale computing and making large and heterogeneous data sets understandable. This is a science frontier specified in CISL’s strategic plan, which specifies these tasks:

  • Prepare for petascale and exascale computing.

  • Partner with peer institutions and combine efforts to develop and enable visualization of large data.

Data compression analysis
This table compares the LMax and RMSE between original and compressed double-precision data using three different compression methods: truncation to 32 bits, using the SPIHT wavelet encoder, and using the SPECK wavelet encoder. Truncation of double-precision data to single precision is used to save space and is a standard practice among numerical modelers. The results above suggest that more information could be retained using SPIHT or SPECK for the 2:1 compression rate obtained by truncation, or that more aggressive compression with SPIHT or SPECK can be used to retain the same amount of information as truncation.

In FY2015 CISL hosted Samuel Li, a Ph.D. student from the University of Oregon, for a two-month visit to NCAR. Samuel’s thesis work, under Prof. Hank Childs, focuses on a variety of aspects of scientific data compression. While at NCAR Samuel continued efforts begun last year by CISL staff to objectively evaluate the state-of-the-art wavelet encoders, SPIHT and SPECK. Some of the most significant findings include:

  • When applied to double-precision data, these encoders are capable of preserving information better than simple truncation to single precision – widely used in practice by modelers of all stripes – while achieving compression factors of 6x to 8x.

  • The encoders also appear to outperform (in terms of distortion rates) all of the compression strategies applied to climate simulation data evaluated in the 2013 CISL paper by Baker et al.

The promising results of this brief summer internship are expected to lead to a followup of the Baker et al., paper in FY2016, as well as possible integration of the SPECK encoder into CISL’s VAPOR package.

This data compression and visualization research is supported by NSF Core funds, a subaward from the University of California at San Diego, 54067252, and KISTI grant C15012.