Expand the content of and access to the RDA

The Research Data Archive (RDA) is a key part of CISL’s computing imperative for data curation and provision. It provides a rich information resource through a large and growing collection of data sets that support scientific studies in climate, weather, Earth System modeling, and increasingly, other related sciences. The RDA is developed to serve the research needs at NCAR and in the associated UCAR community, but since it is an open resource, the global community also frequently accesses it. To meet research community needs, the RDA continuously adds new and augments existing data content. Access is also improved with new tools, web services, and HPC-driven workflows that extract data specified by the users from multi-terabyte data sets. The RDA is nationally and internationally respected for its staff, data management practices, consulting services, and ability to positively affect outcomes in the data domain. This position is advantageous to building collaborations that continually strive to provide better scientific data resources and access.

RDA metrics
These charts show the data access and growth metrics for the RDA during FY2006-2017. a) The number of unique RDA users specified by access pathway: the NCAR HPSS, publicly available web servers, and one-time special requests (orders) prepared for individual users. b) The amount of data delivered to customers, by access pathway. c) The amount of data in the HPSS archive, showing annual growth and not including backups. d) The amount of data on public web servers, showing annual growth. Charts a) and b) indicate the RDA’s significance to the community. Charts c) and d) show the annual progress toward building more valued content into the RDA. Note that services related to the TIGGE project were terminated in 2016, and are no longer represented on these charts.

All efforts to improve the RDA focus on enhancing the productivity of the weather and climate research communities. The RDA strives to minimize the data work burden for the researcher by hosting the needed reference data sets with personal consulting services, and it provides easy access pathways including locally to the CISL HPC from a directly connected central file system, via the web through UIs and standard APIs that support machine-to-machine interoperability, and by creating customized data packages on demand for individuals from large and heterogenous collections. The RDA is also valued for the professionally curated and stewarded data collections that are preserved for the long-term, are citable, support reproducible science, and help NCAR meet federal requirements for open access to data.

In FY2017, over 13,000 unique persons were provided over 2 petabytes of data through various primary access pathways: the NCAR HPSS, public servers on the web, and one-time special requests prepared for individuals. The total number of unique users increased steadily from 2012 through 2017 (see chart a). One-time requests (orders) include subsetting, format conversion, and restaging files from the HPSS to disk. Data delivered by full file downloads on the web increased by over 100 TB while data delivered orders decreased by 150 terabytes in 2017 (chart b). CISL continues to make it easier for users to access terabyte-sized archives on their own. Orders were automatically prepared for over 4,900 individuals, a growth of 300 users from 2016, and they received about 700 terabytes of data. Web users form the largest group, with over 8,000 people downloading over 1.3 petabytes of data. The newest and most-used RDA collections are directly available from NCAR’s Globally Accessible Data Environment (GLADE) to the HPC environment. We currently cannot estimate the metrics for this pathway, but it is substantial because access from the HPSS (tape-based) has dropped to less than 7 terabytes, and anecdotally, our local users are pleased. These metrics indicate that the RDA is an important data resource for a broad community.

The RDA content expanded by 400 terabytes in FY2017 (chart c). The complete RDA is now over 1.4 petabytes, and over 1 petabyte of it is readily available via GLADE (chart d). NCAR users can access the portion of the RDA not available on GLADE directly from the HPSS, and the Data Support Section provides automated procedures to assist outside users with data access from HPSS.

The RDA is constantly changing. Curation extends and adds to existing data sets, and stewardship improves the documentation, creates systematic organization, applies data quality assurance, assigns DOIs, and develops user access. Many routine tasks and background infrastructure developments are necessary to maintain the RDA. Major accomplishments for FY2017 include:

  • On an hourly to monthly basis, updates are made to observational, analysis, and reanalysis data sets from NOAA. These are supplemented with similar updates for ECMWF and JMA products.

  • Expanded automated systems use CISL HPC and GLADE to give users better access to terabyte-sized data sets. More than 49,000 individual data requests were processed resulting in a net data volume reduction of 97% relative to source archive file size.

  • Globus GridFTP RDA data transfer service options were enhanced, and they included the addition of the option for users to select archive files to be transferred over Globus from a dataset collection. Over 450,000 files totaling 22 TB were transferred by RDA users via Globus in 2017.

  • These significant data assets were added to the RDA:

    • Arctic System Reanalysis version 2
    • High Resolution WRF Simulations of the Current and Future Climate of North America
    • NCAR/MOPITT Reanalysis
    • GridRad – Three-Dimensional Gridded NEXRAD WSR-88D Radar Data
    • Dai and Trenberth Global River Flow and Continental Discharge Dataset
    • Dai Global Palmer Drought Severity Index (PDSI)
    • Add-on Data for the ICOADS Value-Added Database (IVAD)
    • GEOS5 Global Atmospheric Forcing Data
    • Daily Gridded North American Snowfall
    • Expanded Thematic Realtime Environmental Distributed Data Services (THREDDS) Data Server (TDS) access to 56 popular GRIB and NetCDF formatted data sets, creating metadata and data access for scientific tools using standard interoperable protocols such as Open-source Project for a Network Data Access Protocol (OPeNDAP)
    • Expanded HPC-driven spatial, temporal, and parameter subsetting with data format conversion options to 68 data sets
    • Increased formal data citation potential by assigning and maintaining DOIs on 87 RDA data sets

RDA maintenance and development within CISL are almost entirely supported by NSF Core funding. A small NASA grant, X15AG22G, supplemented development of ICOADS.