Develop efficient parallel data processing capabilities

PyReshaper speedups
This plot shows the speedup with PyReshaper (using 16 MPI processes) over the old NCO (serial) utility when transforming time-slice data into time-series format. The different datasets range in size from 8 GB (1-degree CICE) to 3 TB (0.1-degree POP), and the speedups range from 3x to 16x.
PyAverager speedups
This plot shows the speedup with PyAverager (using 160 MPI processes) over the old NCO (serial) utility when computing climatological temporal averages from time-series data. The different datasets range in size from 8 GB (1-degree CICE) to 3 TB (0.1-degree POP), and the speedups range from 32x to 1400x.

Efforts to meet the grand challenge of simulating the Earth System require ever-increasing model resolutions and model complexity. The supercomputing systems capable of tackling these challenges produce output data volumes that exceed the capacities of previous generations of data-processing utilities. Because the serial tools formerly used for processing model data will restrict the increasing pace of scientific discovery, researchers require new and efficient tools and techniques for processing today’s and tomorrow’s data flows. Simulations of the Earth System now consider data processing as an integral part of the workflow necessary to produce results in a reasonable amount of time. CISL’s work to parallelize steps in data processing workflows includes the development of new, lightweight Python utilities. Through parallelization, the bottlenecks in each phase of the data processing workflow are being eliminated.

CISL’s parallel Python data processing project has produced several new parallel utilities to handle the Community Earth System Model’s (CESM) current and future data volumes. These parallel utilities are critical to meeting NCAR’s obligations as a member of Phase 6 of the Coupled Model Intercomparison Project (CMIP6). This work – to develop more efficient approaches for data processing and compression – is specified as an action item in CISL’s new strategic plan.

In FY2016, members of the Application Scalability and Performance (ASAP) Group collaborated with NCAR’s Climate and Global Dynamics (CGD) Laboratory to continue improving two parallel Python utilities that were released in FY2015: the PyReshaper and the PyAverager. Each tool relieves congestion at a significant bottleneck in the CESM post-processing workflow. The PyReshaper tool transforms CESM data from synoptic (history or time-slice) format to single-field (time-series) format in parallel. The PyAverager computes climatologically important temporal averages in parallel. Both utilities use the Message Passing Interface (MPI) for parallelism and utilize PyNIO (the Python NCL I/O library) for NetCDF file access.

In FY2016, CISL staff added new features and other improvements to the PyReshaper and the PyAverager, and both tools were made available publicly though GitHub. Also in FY2016, CISL developed a third parallel Python utility, code-named PyConform, to perform the “data standardization” step of the post-processing workflow for CMIP6. This is the last step in the workflow before the CESM data are published. PyConform is currently in the testing phase and its first release is scheduled for early 2017.

This work on parallelizing the post-processing workflow was supported through NSF Core and NSF Special funds.