Develop and deliver platforms for data services and analysis

The CISL high-performance computing and storage environment described in previous sections supports advanced numerical simulation of the Earth environment by the scientific community. The result of these efforts is huge volumes of numerical model outputs and associated observations. These data must be made discoverable by and available to users. Sophisticated analysis capabilities are essential, as is long-term archival preservation of key data for scientific reproducibility and future reuse. In addition, federal funding agencies impose requirements regarding data management planning and open data access.

To help researchers with these challenges, and to help funders and data repositories better support geoscience researchers, staff from CISL, the NCAR Library, and the Earth Observing Laboratory hosted the NSF-funded Geoscience Digital Data Resource and Repository Service (GeoDaRRS) workshop in FY2018. The workshop brought together more than 60 individuals, including geoscience researchers, technology experts, publishers, funders, and data repository personnel. Their discussions of scientific research and the relevance of data management education and policy resulted in a number of recommendations for addressing the community’s challenges.

Additional FY2018 highlights of CISL's work in the area of data repositories, data management support, and software for data analysis and visualization are discussed in the following sections.

Repositories and data services

CISL provides our research community with repositories and data services for hosting, locating, and accessing a variety of observational and model research data collections. These data are served through gateways over high-speed wide-area networks and are also accessible to users of CISL’s HPC, data analysis, and visualization resources. These repositories and data services support our communities’ efforts to extract scientific knowledge from the many petabytes of data available on NCAR’s cyberinfrastructure. Key components of our data repository and services portfolio include:

  • the Research Data Archive (RDA).

  • the Climate Data Gateway (CDG).

  • the Digital Asset Services Hub (DASH).

  • the Coupled Model Intercomparison Project (CMIP) Analysis Platform.

An overview of FY2018 activities associated with each of these components follows.

Research Data Archive

The Research Data Archive (RDA) hosts a large, diverse collection of meteorological and oceanographic observations, operational and reanalysis model outputs, remote sensing data sets, and ancillary data sets, such as topography/bathymetry, vegetation, and land use to support atmospheric and geosciences research. RDA continually adds new data and augments existing data content to meet research community needs. While curation extends and adds to existing data sets, stewardship improves the documentation, creates systematic organization, applies data quality assurance, assigns digital object identifiers (DOIs), and develops user access. RDA maintenance and development within CISL are supported almost entirely by NSF Core funding. A small NASA grant, X15AG22G, supplemented the development of the International Comprehensive Ocean-Atmosphere Data Set.

Major FY2018 accomplishments included:

  • Growth of RDA content by 400 TB. The complete RDA is now over 1.8 PB. More than 1.2 PB of the newest and most-used collections are readily available to HPC users via NCAR’s Globally Accessible Data Environment (GLADE). Ten new data collections were added to the RDA, including selected products from the initial release of the ECMWF-produced ERA-5 reanalysis. Updates were made hourly or monthly to observational, analysis, and reanalysis data sets from NOAA. These were supplemented with similar updates for ECMWF and JMA products.

  • Providing access to more than 2.8 PB of data by 14,100 individuals through various primary access pathways: the NCAR HPSS, public web servers, and one-time special requests prepared for individuals. Web users form the largest group, with more than 8,000 downloading over 1.7 PB of data.

  • Enhancement of Globus GridFTP RDA data-transfer service options. Improvements included the addition of an option for users to request that data be transferred to their local systems as part of the subset processing workflow. More than 240 TB were transferred by 445 RDA users via Globus in FY2018.

  • Expansion of Thematic Realtime Environmental Distributed Data Services (THREDDS) Data Server access to 67 popular GRIB and NetCDF-formatted data sets, creating metadata and data access for scientific tools using standard interoperable protocols such as Open-source Project for a Network Data Access Protocol (OPeNDAP)

  • RDA’s extension of HPC-driven spatial, temporal, and parameter subsetting with data format conversion options to 90 data sets. More than 57,000 individual data subset requests were processed, resulting in a net reduction of 98% in data volume transmitted relative to source archive file size.

  • Assignment of DOIs on 94 RDA data sets, which increased the potential for formal data citation.

Climate Data Gateway

CISL develops and operates the NCAR Climate Data Gateway (CDG), which provides data discovery and access services for global and regional climate model data and software. The CDG supports access to data products from many of NCAR’s community modeling efforts, including IPCC, PCM, AMPS, CESM, NA-CORDEX, NARCCAP, and NMME activities. Each month the gateway is accessed by more than 1,600 users and delivers more than 60 TB of scientific data to the community.

FY2018 accomplishments included:

  • Expansion of new data publications holdings by 1.1 PB, bringing the total data volume to 6.5 PB.

  • Access and downloads of data products by more than 19,000 users worldwide. More than 4,500 users were from U.S. universities and other U.S.-based public and private organizations.

  • Upgrading of data provider publishing software and infrastructure to CentOS 7 virtual machines.

  • Addition of API-token authentication for simpler bulk and scripted file access by end users.

  • Improvements to HPSS tape-based data transfer management, notification and error handling.

  • Data services enhancements to support range requests and continuation of downloads.

  • Enabling of subsetting and NetCDF subset services on select open data sets including NA-CORDEX.

  • Improved metrics processing pipeline and overall performance of metrics capture.

Digital Asset Services Hub

The Digital Asset Services Hub (DASH) is a cross-organizational integrative service supporting public access to data, software, models, and publications. DASH Search provides discovery of digital assets across multiple NCAR repositories. DASH Consulting provides coordination of data archiving throughout NCAR, data management plan assistance, and metadata guidance. DASH’s technical and informational services are guided by the Data Stewardship Engineering Team (DSET) under the supervision of and with support from the NCAR Executive Committee (EC). DASH Search now covers 4,500 data sets, 5,300 publications, and several software packages and models.

Highlights in FY2018 included:

  • The NCAR EC’s establishment of a policy foundation for coordinated data archiving in order to meet open access requirements (including existing lab-managed repositories and a new DASH repository).

  • Selection of the core technical infrastructure for the DASH Repository, based on the Advanced Cooperative Arctic Data and Information Service (ACADIS) system that was developed at NCAR. The surrounding protocols, documentation, and procedures for operations are designed to meet formal Trusted Digital Repositories criteria. Initial release for early adopters is expected in January 2019.

  • Consulting and development support by the Data Curation and Stewardship Coordinator in the areas of interface user experience, metadata preparation, assistance with DOI assignment, and proposal data management plans. Many supporting online educational materials were added to the DASH system and promoted across NCAR.

CMIP Analysis Platform

In FY2018, CISL completed its migration of CMIP5 data to the CMIP Analysis Platform, which was introduced in FY2016 as an allocated and supported service for users interested in climate model intercomparisons. The platform allows users to conduct large-scale analyses on a “lending library” of published CMIP5 data. In FY2018, more than 100 projects had active CMIP Analysis Platform allocations, and more than 40 TB of non-NCAR CMIP5 data were being hosted by CISL at the end of the fiscal year. CISL plans to add CMIP6 data to the CMIP Analysis Platform soon after modeling centers begin publishing the new data in FY2019.

Analysis and visualization software

CISL leads the development and support of open-source software for the analysis and visualization of Earth system science data. Most notably this includes the NCL suite of products and the VAPOR package, both of which boast user communities numbering in the tens of thousands worldwide. NCL provides a collection of more than 1,500 scripting functions for analyzing, reading, and visualizing Earth system data, while VAPOR is an interactive, GUI-based package that offers advanced 3D visualization capabilities.

VAPOR visualization of an MPAS grid

VAPOR visualization of an MPAS grid showing a wire frame of the dual triangular mesh, along with contour lines of surface pressure. Support for unstructured grids was added to VAPOR in FY2018.

CISL finished the port of its PyNIO and PyNGL Python packages to Python 3.x and released versions of them in FY2018 in anticipation of the retirement of Python 2.x. CISL also released NCL version 6.5.0, with support for task parallelism, numerous new computational routines, and user contributions that include a built-in profiler, 29 new color tables, and improvements for time conversion routines. There were also several releases of WRF-Python, including 1.2.0, which added better support for xarray. In FY2018 there were 30,785 downloads of NCL, 18,019 downloads of PyNIO, 3,807 downloads of PyNGL, and 27,228 downloads of WRF-Python.

CISL began a review of the viability of NCL in a Python ecosystem by conducting an extensive survey and hosting the first-ever NCL Advisory Panel. The preliminary but strong consensus was that further development of NCL’s computational and visualization capabilities should be moved to Python, and a roadmap has been developed to guide NCL users through the “pivot to Python” transition while also adopting a more open development model. This will be the major focus for FY2019.

The VAPOR team released a first version of a new package, VAPOR3, in FY2018. VAPOR3 is the culmination of efforts begun in 2015 to refactor the VAPOR codebase to address a number of architectural deficiencies in the original design. These efforts were aimed at facilitating open development by the user community; generalizing the data model; improving support for the multitude of computational meshes employed by Earth system science numerical models; and simplifying the user interface. Work on VAPOR3 is ongoing.

Additional noteworthy FY2018 VAPOR accomplishments include:

  • Adding support for models employing unstructured grids (MPAS, for example) and the ability to simultaneously visualize multiple, correctly registered data sets (atmosphere and ocean, for example).

  • Winning an eighth year of funding ($100,000) for VAPOR development from the Korean Institute of Science and Technology Information.

  • Development of NCL data readers for VAPOR's wavelet-based, progressive-access data format (supported in part by an NSF SSI2 grant).

  • More than 3,500 downloads and ~30 journal citations.