Develop efficient parallel data processing capabilities

This plot shows the speedup with PyReshaper (using 16 MPI processes) over the old NCO (serial) utility when transforming time-slice data into time-series format. The different datasets range in size from 8 GB (1-degree CICE) to 3 TB (0.1-degree POP), and the speedups range from 3x to 16x.
This plot shows the speedup with PyAverager (using 160 MPI processes) over the old NCO (serial) utility when computing climatological temporal averages from time-series data. The different datasets range in size from 8 GB (1-degree CICE) to 3 TB (0.1-degree POP), and the speedups range from 32x to 1400x.
This plot shows an example of a directed acyclic graph (DAG) describing the processes performed by PyConform to generate a single output file containing the user-defined variables x and y from input variables X1 and X2. Each output file is written by a separate MPI process, allowing speedups over previous standardization tools on the order of 16x to 40x with 16 MPI processes.

Efforts to meet the grand challenge of simulating the Earth System require ever-increasing model resolutions and model complexity. The supercomputing systems capable of tackling these challenges produce output data volumes that exceed the capacities of previous generations of data-processing utilities. Because the serial tools formerly used for processing model data will restrict the increasing pace of scientific discovery, researchers require new and efficient tools and techniques for processing today’s and tomorrow’s data flows. Simulations of the Earth System now consider data processing as an integral part of the workflow necessary to produce results in a reasonable amount of time. CISL’s work to parallelize steps in data processing workflows includes the development of new, lightweight Python utilities. Through parallelization, the bottlenecks in each phase of the data processing workflow are being eliminated.

CISL’s parallel Python data processing project has produced several new parallel utilities to handle current and future data volumes from the Community Earth System Model (CESM). These parallel utilities are critical to meeting NCAR’s obligations as a member of Phase 6 of the Coupled Model Intercomparison Project (CMIP6). This work – to develop more efficient approaches for data processing and compression – is specified as an action item in CISL’s new strategic plan.

In FY2017, the I/O and Workflow Applications (IOWA) Group, in collaboration with NCAR’s Climate and Global Dynamics (CGD) Laboratory, continued to develop and improve its parallel Python utilities, PyReshaper and PyAverager. In addition, members of the IOWA Group developed and released the PyConform utility – based on the directed acyclic graph (DAG) – to perform data standardization for Model Intercomparison Projects such as CMIP6. Each parallel Python tool relieves congestion at a significant bottleneck in the CESM post-processing workflow, each tool uses the Message Passing Interface (MPI) for parallelism, and each produces data in the NetCDF data format. All of these tools are provided to the community via GitHub.

Also in FY2017, the IOWA Group started testing and deployment of the Cylc workflow automation tool for use with CESM runs. Cylc automation is being incorporated into the next release of CESM. Early tests of Cylc with CESM showed the ability to produce roughly 750 TB of CESM data in 1 month, enabling more efficient use of human and computational resources during campaign data production projects.

This work on parallelizing the post-processing workflow was supported through NSF Core and NSF Special funds.