Provide science gateways and data-sharing services

CISL builds and operates science gateways and data services that provide diverse scientific communities with access to data-sharing infrastructure. Our projects span climate science, regional climate research, arctic science, solar science, digital preservation, and international efforts to develop metadata and knowledge infrastructure. Many of these efforts are tied to major interagency, national, and international initiatives, including the Intergovernmental Panel on Climate Change (IPCC), the World Climate Research Program (WCRP), and the Library of Congress’ National Digital Information and Infrastructure Preservation Program (NDIIPP).

CISL data services
CISL services provide access to shared data management and data use cyberinfrastructure for diverse scientific communities of over 2,000 active users monthly. New projects, such as Capstone, are under development bringing advanced analysis capabilities to current and new user communities using the extensive base services managed by CISL.

CISL provides data tools and services for locating, accessing, transferring, and analyzing a variety of research data collections. CISL gateways span climate science, regional climate change, arctic science, solar science, digital preservation, and international efforts to develop metadata and knowledge infrastructure. CISL serves data to the community through data gateways over high-speed wide-area networks and via high-speed disk and near-line tape systems.The CMIP Analysis Platform provides a one-stop shop for analysis and visualization of CMIP data on CISL’s DAV resources. The Data Sharing and Data Transfer services are built on Globus services and provide high-performance access to NCAR services and data. CISL is pursuing expanding our data services with new web-based capabilities (e.g., server-side analysis, microservices, specialized queries) to enhance the overall usability and impact of these services and data assets as well as other data resources across NCAR.

Science gateway advances, open data access, and end-user support allow scientists and wide-ranging data consumers to spend less time handling and preparing data and more time on their research. Scientists increasingly want to openly access data products and use cyber-resources via web- and grid-based environments such as science gateways or virtualized environments such as clouds. Such environments can increase scientific productivity by abstracting away a large set of arcane, machine-specific knowledge, allowing scientists to focus on their science. Here, the challenge for CISL and other HPC organizations is to provide a new layer of virtualized services to support research communities – but on a fixed budget. These services should scale with demand and operate seamlessly across multiple, heterogeneous, and potentially distributed computing systems. Such requirements present profound technical challenges. For example, while cloud technologies are being rapidly adopted in the enterprise computing sector, the parallel performance of large-scale applications running on cloud platforms remains poor.

Our contributions to science gateways support CISL’s computing imperative for software cyberinfrastructure by maintaining, operating, and supporting software specific to the simulation, analysis, and forecasting needs of the atmospheric and related sciences. They also address CISL’s computing frontier for center virtualization by operating science gateways and other technologies that provide critical cyberinfrastructure (CI) to broad communities. Finally, operational services such as the NCAR Climate Data Gateway, the NCAR ESGF Data Node, Data Sharing Services, new projects such as Capstone and other collaborations address CISL’s strategic action item to meet the challenges posed by large and heterogeneous environmental data, and to establish metadata standards for diverse collections of data and models.

NCAR Climate Data Gateway

CISL operates the NCAR Climate Data Gateway that provides data discovery and access services for global and regional climate model data, knowledge, and software. The NCAR Climate Data Gateway participates in the Earth System Grid Federation (ESGF), which is a globally distributed petascale data management environment for CMIP5/IPCC-AR5 and U.S. climate science. The NCAR Climate Data Gateway supports community access to data products from many of NCAR’s community modeling efforts, including IPCC, PCM, AMPS, CESM, NARCCAP, and NMME data products. The gateway is heavily used by over 1,000 users monthly and delivers over 60 terabytes monthly of scientific data to the community. Accomplishments in FY2017 include expanded open access to select datasets, expanded OpenDAP and NetCDF subsetting services on openly accessible data, improved download performance for tape-based collections, publishing workflow improvements, and performance enhancements.

CMIP Analysis Platform

Most academic researchers do not have the resources to download, store, and analyze large portions (often tens or hundreds of terabytes) of the 2 PB of data published worldwide from the Coupled Model Intercomparison Project, Phase 5 (CMIP5). This challenge will be exacerbated in Phase 6, with data volumes expected to be 10 or 20 times larger than in CMIP5. For CMIP6, NCAR alone is projecting the creation of 5 PB of data or more. To address these barriers, NCAR has deployed a CMIP Analysis Platform opportunity to support analyses on a “lending library” of CMIP5 data. The CMIP Analysis Platform integrates published CMIP5 data and a suite of human and software support services overlaid on NCAR’s operational analysis and disk storage environment. In FY2017 CISL transitioned this service into a supported operational mode, populating the analysis platform with high-value CMIP5 data sets based on user requests and worked to improve the data ingestion process in anticipation of CMIP6. The CMIP Analysis Platform currently provides over 20 TB of data and is actively being used by a number of researchers.

Data Sharing Service

The NCAR Data Sharing Service leverages the capabilities of Globus Plus to increase customization options for storage as well as data sharing. Globus Plus provides a feature that allows researchers to share data with colleagues outside of their home institutions, greatly facilitating collaborative work. The NCAR Data Sharing Service gives researchers a way to share large data sets with collaborators around the world using a simple web-based interface while leveraging the network bandwidth of NCAR’s data transfer nodes. Usage of the Data Sharing Service continued to grow in FY2017 as the space available for the sharing service expanded to 500 TB. While the amount in use fluctuates to over 100 TB, we are currently serving 11 TB of data through the service.

Data Transfer Services

The NCAR Data Transfer Services, built on Globus services, provides parallel data transfer tools with direct access to all GLADE-hosted data. Globus, a project of the Computation Institute (a partnership of the University of Chicago and Argonne National Laboratory), is a software service that has been described as “Dropbox for big data.” It is broadly used in the scientific community. Through the use of dedicated data transfer nodes, users can easily transfer data within NCAR and to peer institutions through the use of Globus online web-based services or scripted data transfer mechanisms. The data transfer services were moved to new nodes and to a new 40-GbE network environment in FY2017. The new nodes support Globus transfer services along with traditional gridFTP, scp, sftp, and bbcp services as well as access to the HPSS system for users without an HPC account.

Capstone: Data Analysis as a Service Platform

Although climate and weather information are vital to research and decision-making in a wide variety of societally important contexts, it is difficult for scientists, resource managers, and concerned citizens to access and share the expert knowledge required for analyzing and drawing conclusions from weather and climate data. To begin addressing these issues, we are pursuing an integrated data analysis system that will provide server-side analysis of both model output and observational data. Capstone has the ambitious goal of addressing the data analysis needs of a diverse set of users (e.g., climate scientists, researchers and engineers from other disciplines, resource managers, and citizen scientists) who may lack some of the necessary knowledge or resources to do so on their own.

An analysis platform or science commons for the NCAR community is a bold, high-risk and high-reward goal. Our focus in FY2017 has been on developing guiding use cases, engaging with and understanding technical approaches of peer organizations, and identifying pilot projects to prototype and learn from. While this effort is currently in a startup stage, we’ve evaluated technologies and architectures via interns in CISL’s 2016 and 2017 SIParCS program, and we’re engaging with potential partners and seeking funding opportunities with the SDSC Workflows/WOrDS center, UKMO’s Informatics Lab, and TACC’s Agave project. We’ve identified an initial pilot project with the NCAR/MMM lab as a technical proof-of-concept, as well as pursuing additional funding and partnerships. Other activities related to Capstone include understanding cloud vendor capabilities, costs, and roadmaps for HPC and data analytics, and exploring containers based on data analysis and preparation software (e.g., NCL, CDO, NCO) for use in workflows using web-based services.

DASH Search and Discovery

CISL supports the NCAR-wide Digital Asset Services Hub (DASH) effort with leadership, engineering, and data curation support. DASH is dedicated to providing support, engagement, and training for digital assets from NCAR and UCAR Community Programs (UCP), including datasets, publications, software, and models. A key early product of this effort is the DASH Search and Discovery Portal, which provides access to digital assets across UCAR and NCAR. In FY2017 we developed and released a Beta version of the DASH Search and Discovery Portal, which includes a workflow system for data providers to store and manage their metadata utilizing the ISO 19115 metadata standard. The DASH Search is based on the CKAN Open Source data portal software, which has been customized to harvest, search, and display digital assets based on the NCAR metadata dialect.

Chronopolis: Federated Digital Preservation over Space and Time

There is a critical and growing need to organize, preserve, and make accessible the increasing number of digital holdings that represent vital intellectual capital, much of which is precious and irreplaceable. Chronopolis is a strategic collaboration among the San Diego Supercomputing Center (SDSC, lead organization), NCAR/CISL, the University of California Library System, and the University of Maryland. It is aimed at developing national-scale digital preservation infrastructure that has the potential to provide broad services to any community with digital assets for science, engineering, humanities, and more. In FY2017, CISL continued developing the Chronopolis Dashboard tool for system monitoring and federation-wide reporting and capacity planning. CISL continued to provide operational support for the NCAR storage node which currently manages 30 terabytes and over 3 million managed objects.

Funding sources

The NCAR Climate Data Gateway is 100% supported by NSF Core funding. The CMIP Analysis Project is supported by NSF special funds. The Data Transfer Services and Data Storage Service are 100% supported by NSF Core funding. Chronopolis is supported by special funds from the Chronopolis project. The Capstone project is supported by NSF Core funds, and we are actively pursuing additional funding.