Parallel data processing

Parallel method speedup
This figure compares the previous serial-processing method with a new parallel method. It shows the time to calculate temporal averages using the previous Netcdf Command Operators (NCO) version on a single core of Caldera in blue, while the red bars show pyAverager performing the same tasks while running on 120 cores of Yellowstone.

The parallel data processing project is critical to the objectives of NCAR for several reasons. While the rate at which it is possible to generate simulation data from the CESM has increased rapidly over the last five years, our ability to analyze it has not due to the serial nature of the post-processing workflow. The impact of a serial post-processing workflow was apparent during the CMIP5 project when the post-processing took as long to perform as the initial simulations. This project will increase the scientific discovery rate by removing post-processing bottlenecks.

In FY2015, members of the Application Scalability and Performance (ASAP) Group in collaboration with NCAR’s Climate and Global Dynamics (CGD) Laboratory completed the parallelization of a significant amount of the CESM post-processing workflow including two disk-I/O-intensive calculations and the generation of diagnostic plots. The PyReshaper tool converts time-slice to time-series format, while pyAverager performs a number of climatologically important temporal averages. Both applications are written in Python, use the Message Passing Interface (MPI) for parallelism, and utilize pyNIO (the I/O library from NCL) to access files on disk. Both pyReshaper and pyAverager have been adopted by CGD and are included in future CESM releases. The component-specific diagnostic packages that scientists use to evaluate the fidelity of simulations were redesigned to allow the existing NCL-based scripts to be executed in parallel.

This work on parallelizing the post-processing workflow was supported through NSF Core and NSF Special funds.