Develop and deliver platforms for data services and analysis

Numerical simulation of the Earth's environment by the scientific community using CISL high-performance computing resources produces large volumes of model outputs and associated observations and derived products that must be stored and made accessible to users. Some data need long-term archival preservation and stewardship for scientific reproducibility, future reuse, and to meet open data requirements imposed by federal funding agencies and journal publishers. Appropriate services and tools for data discovery, retrieval, subsetting, analysis, and visualization also are needed. Following are highlights of CISL’s efforts to address these issues in FY2019. 

Data repositories and services

CISL operates several data repositories and services for the benefit of NCAR researchers and the scientific community:

  • Research Data Archive

  • Climate Data Gateway

  • Digital Asset Services Hub 

Research Data Archive

The Research Data Archive (RDA) hosts a large, diverse collection of meteorological and oceanographic observations, operational and reanalysis model outputs, remote sensing data sets, and ancillary data sets, such as topography/bathymetry, vegetation, and land use to support atmospheric and geosciences research. RDA continually adds new data and augments existing data content to meet research community needs.

FY2019 accomplishments included:

  • Growth of RDA content by 1.1 PB. The complete RDA is now over 2.9 PB. Twelve new data collections were added to the RDA, including the full suite of products from the 1979 to 2019 release of the ERA-5 reanalysis produced by the European Centre for Medium-Range Weather Forecasts (ECMWF). Updates were made hourly or monthly to observational, analysis, and reanalysis data collections from the National Oceanic and Atmospheric Administration. These were supplemented with similar updates for ECMWF and Japan Meteorological Agency products.

  • Providing access to more than 5.1 PB of data to 15,000 individuals.

  • Enhancement of Globus GridFTP RDA data-transfer service options. Improvements included the addition of an option for users to select individual archive files to be transferred to their local systems from RDA-generated archive data file lists. Users transferred more than 1.1 PB via Globus in FY2019.

  • Expansion of Thematic Realtime Environmental Distributed Data Services (THREDDS) Data Server access to 77 popular GRIB and NetCDF-formatted data sets.

  • Extension of HPC-driven spatial, temporal, and parameter subsetting with data format conversion options to 95 data collections. More than 70,000 individual data subset requests were processed, resulting in a net reduction of 98% in data volume transmitted relative to source archive file size.

  • Assignment of DOIs to 215 RDA data collections, which increased the potential for formal data citation.

  • The Data Engineering and Curation Section started harvesting and compiling available RDA data citation counts based on formal citations that use RDA DOIs. The effort identified 232 peer-reviewed articles or books that cited RDA data sets from January 2018 to October 2019.

  • The RDA received the first Trusted Digital Repository Certification at NCAR.

RDA maintenance and development are supported almost entirely by NSF Core funding. A small NASA grant, X15AG22G, supplemented development of the International Comprehensive Ocean-Atmosphere Data Set.

Climate Data Gateway

CISL develops and operates the NCAR Climate Data Gateway (CDG), which provides data discovery and access services for global and regional climate model simulation data. More than 3,000 users access the gateway each month, using data products from many of NCAR’s current and historic community modeling efforts. FY2019 accomplishments included:

  • Access and download of data products by more than 31,000 users. Over 9,000 users were from U.S. universities and other U.S.-based public and private organizations.

  • Expansion of Community Earth System Model (CESM) data holdings by 1.2 PB, bringing the total data volume to 8.1 PB.

  • Establishment of 74 DOIs for community data sets.

  • Expansion of THREDDS OpenDAP and NetCDF Subset Services to disk-based collections.

  • Upgraded authorization model to simplify future code and change management.

  • Enhanced change tracking and metadata and data auditing capabilities.

  • Improved overall publishing performance for large data collections.

Digital Asset Services Hub

The Digital Asset Services Hub (DASH) is a cross-organizational integrative service supporting public access to data, software, models, and publications. DASH provides a centralized system for searching digital scientific assets across NCAR and UCAR Community Programs; coordination of data archiving throughout NCAR; data management plan assistance; and metadata production guidance. DASH Search now covers 4,900 data sets, 6,000 publications, and several software packages and models.

Since first coming online in January 2019, the new DASH Repository has grown to an archive of 12 digital assets. It provides persistent data archiving and distribution of data collections from UCAR/NCAR researchers and projects. It complements other NCAR-managed data repositories and focuses on providing long-term preservation and stewardship of NCAR's small-scale data collections.

Other achievements in FY2019 included:

  • Implementation of protocols, documentation, and procedures for operations of the DASH Repository in order to meet formal Trusted Digital Repository criteria. 

  • Improved consulting and development support in the areas of user experience, metadata preparation, DOI assignment assistance, proposal data management plans, and support for the DASH Repository data workflow. 

Analysis platforms

The Coupled Model Intercomparison Project (CMIP) Analysis Platform allows users to conduct large-scale analyses on a pool of published CMIP data. CMIP AP is a dedicated storage area on CISL's high-performance GLADE file system; authorized users can compute directly on the data using the Casper data analysis and visualization cluster.

In FY2019:

  • CISL expanded platform operations to support CMIP6 data requests.

  • More than 100 projects had active CMIP AP allocations.

  • More than 50 TB of non-NCAR CMIP5 data and 300 TB of CMIP6 data was transferred to the platform. This represents data products from 50 modeling centers participating in CMIP activities.

CISL also began an effort in FY2019 to host key data sets in the commercial cloud to enable large-scale computation by researchers or industry users who do not have access to NCAR's on-premises high-performance computing, analysis, and visualization resources. The cloud-hosted data are converted from the original NetCDF format to an analysis-optimized format.

Initial steps in this effort included:

  • Publication of approximately 70 TB of CESM Large Ensemble (LENS) data on Amazon Web Services (AWS) in collaboration with the Amazon Sustainability Data Initiative. Storage and egress charges were waived as part of the AWS Public Dataset Program.

  • Development of a Jupyter Notebook reproducing previously published analyses of CESM LENS data.

Analysis and visualization software

CISL leads the open development and support of software for the analysis and visualization of geoscience data, including the new Geoscience Community Analysis Toolkit (GeoCAT) libraries, the desktop VAPOR package for 3D visualization, the Meteo augmented reality (Meteo AR) application, and the legacy NCAR Command Language (NCL).

Geoscience Community Analysis Toolkit

GeoCAT is a new effort that began in FY2019 to provide a new collection of functions for analyzing and visualizing geoscience data. GeoCAT incorporates NCL-related tools into a Python interface that is designed to scale across distributed-memory and multi-core architectures. Work on GeoCAT is being carried out following an open development model that seeks to engage the user community in numerous areas, including implementation, support, and training.

Much of FY2019 was spent staffing up the GeoCAT development team, planning, prototyping, and evaluating. CISL published a GeoCAT development roadmap, established a web site (https://geocat.ucar.edu/), and produced a number of rolling releases to get early feedback from stakeholders.

VAPOR

VAPOR is an interactive, GUI-based package offering advanced 3D visualization capabilities that target exploratory science. CISL released version 3.1 of VAPOR3 in FY2019 with significant new features that include volume rendering, isosurface rendering, support for 3D Model for Prediction Across Scales data sets, and 3D data slicing. Users downloaded VAPOR2 approximately 900 times and cited it 32 times in FY2019. VAPOR3 was downloaded 461 times following its release at the end of July 2019. An article on VAPOR3 was published in Atmosphere (doi:10.3390/atmos10090488). Figure 1 below is from that article.

VAPOR visualization
Figure 1. VAPOR visualization of a numerically simulated tornado, demonstrating VAPOR3’s new volume-rendering capability. This image, produced by the University of Wisconsin’s Leigh Orf, appeared in the August 2019 issue of the journal Atmosphere.

Meteo AR

CISL continued providing support and improvements to its Meteo AR application in FY2019, adding around 2,700 new users and increasing the total number of downloads to more than 13,550. Meteo AR is an example of augmented and virtual reality technology that makes NCAR science more engaging and accessible to the public. The app displays interactive virtual objects – such as an animated globe or a hurricane over real world imagery – that are captured by a mobile device’s camera. The app works with the NCAR “science cube” and downloadable science sheets that provide background information about topics such as El Niño, hurricanes, and climate change. The app is free and available for download at the Apple Store and for Android devices at Google Play.

NCAR Command Language

Following the release of NCL version 6.5.2 in the spring, CISL announced to the user community that it would put the popular scripting language into maintenance mode and turn its focus to the newly launched GeoCAT suite of tools. NCL provides a collection of more than 1,500 scripting functions for analyzing, reading, and plotting geoscience data, and has tens of thousands of users worldwide.