Expand the content of and access to the RDA

RDA metrics
These charts show the data access and growth metrics for the RDA during FY2006-2016: a) The number of unique RDA users specified by access pathway: the NCAR HPSS, publicly available web servers, and one-time special requests (orders) prepared for individual users. b) The amount of data delivered to customers, by access pathway. c) The amount of data in the HPSS archive, showing annual growth and not including backups. d) The amount of data on public web servers, showing annual growth. Charts a) and b) indicate the RDA’s significance to the community. Charts c) and d) show the annual progress toward building more valued content into the RDA. Note that services related to TIGGE were terminated in 2016.

The Research Data Archive (RDA) is a key part of CISL’s computing imperative for data curation and provision. It provides a rich information resource through a large and growing collection of data sets that support scientific studies in climate, weather, Earth System modeling, and increasingly, other related sciences. The RDA is developed to serve the research needs at NCAR and in the associated UCAR community, but since it is an open resource, the global community also frequently accesses it. To meet research community needs, the RDA continuously adds new and augments existing data content. Access is also improved with new tools, web services, and HPC-driven workflows that extract data specified by the users from multi-terabyte data sets. The RDA is nationally and internationally respected for its staff, data management practices, consulting services, and ability to positively affect outcomes in the data domain. This position is advantageous to building collaborations that continually strive to provide better scientific data resources and access.

All efforts to improve the RDA focus on enhancing the productivity of the weather and climate research communities. The RDA strives to minimize the data work burden for the researcher by hosting the needed reference data sets with personal consulting services, and it provides easy access pathways including locally to the CISL HPC from a directly connected central file system, via the Web through UIs and standard APIs that support machine-to-machine interoperability, and by creating customized data packages on demand for individuals from large and heterogenous collections. The RDA is also valued for the professionally curated and stewarded data collections that are preserved for the long-term, are citable, support reproducible science, and help NCAR meet Federal requirements for open access to data.

In FY2016, over 12,500 unique persons were provided about 2.2 petabytes of data through various primary access pathways: the NCAR HPSS, public servers on the web, and one-time special requests prepared for individuals. The total number of unique users increased steadily from 2012 through 2016 (see chart a). One-time requests (orders) include subsetting, format conversion, and restaging files from the HPSS to disk. Data delivered by full file downloads on the web increased 100 terabytes and orders gained 300 terabytes in 2016 (chart b). CISL is making it easier for users to access terabyte-sized archives on their own. Orders were automatically prepared for over 4,600 individuals, and they received about 850 terabytes of data. Web users form the largest group, with 7,800 people downloading over 1.2 petabytes of data. The newest and most-used RDA collections are directly available from NCAR’s Globally Accessible Data Environment (GLADE) to the HPC environment. We currently cannot estimate the metrics for this pathway, but it is substantial because access from the HPSS (tape-based) has dropped to less than 10 terabytes, and anecdotally, our local users are pleased. These metrics indicate that the RDA is an important and growing data resource for a broad community.

The RDA content expanded by 160 terabytes in FY2016 (chart c). The complete RDA is now over 1.0 petabytes, and over 700 terabytes of it is readily available via GLADE (chart d). The THORPEX (THe Observing system Research and Predictability EXperiment) Interactive Grand Global Ensemble (TIGGE) archive was removed from the RDA causing the dramatic drop in overall size observed in 2016 (chart c). NCAR users can access the portion of the RDA not available on GLADE directly from the HPSS, and the Data Support Section provides automated procedures to assist outside users with data access from HPSS.

The RDA is constantly changing. Curation extends and adds to existing data sets, and stewardship improves the documentation, creates systematic organization, applies data quality assurance, assigns DOIs, and develops user access. Many routine tasks and background infrastructure developments are necessary to maintain the RDA. Major accomplishments for FY2016 include:

  • On an hourly to monthly basis, updates are made to observational, analysis, and reanalysis data sets from NOAA. These are supplemented with similar updates for ECMWF and JMA products.

  • Expanded automated systems that use CISL HPC and GLADE to give users better access to terabyte-sized data sets. More than 75,000 individual data requests were processed resulting in a net data volume reduction of 98% relative to source archive file size.

  • RDA’s data-transfer services were fully Globus GridFTP enabled. This integration included a convenient identity-provider layer so that users can log in with their RDA credentials to manage Globus-based data transfers of RDA data. In addition, the RDA’s user “dashboard” feature allows users to view all Globus shares granted to them and provides a direct link to the Globus interface to initiate a data transfer. Over 5.6 million files totaling 125 TB were transferred by RDA users via Globus.

  • Added significant data assets to the RDA:

    • The International Comprehensive Atmosphere Data Set Release 3
    • NCAR MMM 10-member, 3-km, real-time ensemble prediction system
    • Cross-Calibrated Multi-Platform Ocean Surface Wind Vector Analysis Product V2
    • An Ensemble of Atmospheric Forcing Files from a CAM reanalysis
    • EarthScope USArray Transportable Array (TA) Surface Pressure Observations
    • Coordinated Ocean-ice Reference Experiments - Phase II
    • ECMWF IFS CY41r2 High-Resolution Operational Forecasts
    • Arctic System Reanalysis 30km Monthly Means
    • Expanded Thematic Realtime Environmental Distributed Data Services (THREDDS) Data Server (TDS) access to 49 popular GRIB and NetCDF formatted data sets, creating metadata and data access for scientific tools using standard interoperable protocols such as Open-source Project for a Network Data Access Protocol (OPeNDAP)
    • Expanded HPC-driven spatial, temporal, and parameter subsetting with data format conversion options to 63 data sets
    • Increased formal data citation potential by assigning and maintaining DOIs on 79 RDA data sets

RDA maintenance and development within CISL are almost entirely supported by NSF Core funding. A small NASA grant supplemented development of ICOADS.