Provide science gateways and data-sharing services

Capstone diagram
CISL services provide access to shared data management and data use cyberinfrastructure for diverse scientific communities with over 2,000 active users monthly. New projects, such as Capstone, are under development bringing advanced analysis capabilities to current and new user communities utilizing the extensive base services CISL manages.

CISL builds and operates science gateways, data services, and data tools that provide diverse scientific communities with access to data-sharing infrastructure. Our projects span climate science (e.g., the Climate Simulation Data Gateway and the Coupled Model Intercomparison Project (CMIP) Analysis Platform), regional climate research, arctic science, solar science, digital preservation, and international efforts to develop metadata and knowledge infrastructure. Many of these efforts are tied to major interagency, national, and international initiatives, including the Intergovernmental Panel on Climate Change (IPCC), the International Polar Year (IPY), the World Climate Research Program (WCRP), and the Library of Congress’ National Digital Information and Infrastructure Preservation Program (NDIIPP).

CISL serves data to the community through data gateways over high-speed wide-area networks and via high-speed disk and near-line tape systems. The CMIP Analysis Platform provides a one-stop shop for analysis and visualization of CMIP data on CISL’s DAV resources. The Data Sharing and Data Transfer services are built on Globus services and provide high performance access to NCAR services and data. CISL is working to expand our data services with new web-based capabilities (e.g., server-side analysis, microservices, and specialized queries) to enhance the overall usability and impact of these services and data assets as well as other data resources across NCAR.

Science gateway advances, open data access, and end-user support allow scientists and wide-ranging data consumers to spend less time handling and preparing data and more time on their research.

Scientists increasingly want to openly access data products and use cyber-resources via web- and grid-based environments such as science gateways or virtualized environments such as clouds. Such environments can increase scientific productivity by abstracting away a large set of arcane, machine-specific knowledge, allowing scientists to focus on their science. Here, the challenge for CISL and other HPC organizations is to provide a new layer of virtualized services to support research communities – on a fixed budget. These services should scale with demand and operate seamlessly across multiple, heterogeneous, and potentially distributed computing systems. Such requirements present profound technical challenges. For example, while cloud technologies are being rapidly adopted in the enterprise computing sector, the parallel performance of large-scale applications running on cloud platforms remains poor.

Our contributions to science gateways support CISL’s computing imperative for software cyberinfrastructure by maintaining, operating, and supporting software specific to the simulation, analysis, and forecasting needs of the atmospheric and related sciences. They also address CISL’s computing frontier for center virtualization by operating science gateways and other technologies that provide critical cyberinfrastructure (CI) to broad communities. Finally, operational services provided for the Climate Simulation Data Gateway, ESGF, WMO, Data Sharing Services, new projects such as Capstone, and other collaborations address CISL’s strategic action item to meet the challenges posed by large and heterogeneous environmental data, and to establish metadata standards for diverse collections of data and models.

Climate Simulation Data Gateway

CISL operates the Climate Simulation Data Gateway that provides data discovery and access services for global and regional climate model data, knowledge, and software. The Climate Simulation Data Gateway participates in the Earth System Grid Federation (ESGF), which is a globally distributed petascale data management environment for CMIP5/IPCC-AR5 and U.S. climate science. The Climate Simulation Data Gateway supports community access to data products from many of NCAR’s community modeling efforts, including IPCC, PCM, AMPS, CESM, NARCCAP, and NMME data products. This resource is heavily used by over 1,000 users monthly and each month delivers over 30 terabytes of scientific data to the community.

Accomplishments in FY2016 include expanded open access to NMME and other select datasets, expanded OpenDAP services on openly accessible data and a re-worked publishing workflow built to improve performance of large-volume and large-velocity data collections.

CMIP Analysis Platform

Most academic researchers do not have the resources to download, store, and analyze large portions (often tens or hundreds of terabytes) of the two petabytes of data published worldwide from Phase 5 of the Coupled Model Intercomparison Project (CMIP5). This challenge will be exacerbated in Phase 6, with data volumes expected to be 10 or 20 times larger than in CMIP5. For CMIP6, NCAR alone is projecting it will generate 5 PB of data or more. To address these barriers, NCAR has deployed a CMIP Analysis Platform opportunity to support analyses on a “lending library” of CMIP5 data. The CMIP Analysis Platform integrates published CMIP5 data and a suite of human and software support services overlaid on NCAR’s operational analysis and disk storage environment.

We are prototyping the analysis platform within the existing Yellowstone environment and populating the analysis platform with high-value CMIP5 data sets to demonstrate the feasibility of the concept and to better understand user needs, the support effort required, and the demands on the compute, storage, and software environment. Announced to the user community in January 2016, the CMIP Analysis Platform has already been requested by a number of users, and several terabytes of data have been added. The platform has also been integrated with CISL’s support environment to allow us to track requests.

NCAR Data Sharing Service

The NCAR Data Sharing Service leverages the capabilities of Globus Plus to increase customization options for storage as well as data sharing. Globus Plus refers to a feature that allows researchers to share data with colleagues outside of their home institutions, greatly facilitating collaborative work. The NCAR Data Sharing Service provides researchers a way to share large data sets with collaborators around the world using a simple web-based interface while leveraging the network bandwidth of NCAR's data transfer nodes.

Usage of the Data Sharing Service continued to grow in FY2016 with expansion of the space available for the sharing service. We are currently serving 96 TB of data through the service.

Data Transfer Services

The NCAR Data Transfer Services, built on Globus services, provide parallel data transfer tools with direct access to all GLADE-hosted data. Globus, a project of the Computation Institute (a partnership of the University of Chicago and Argonne National Laboratory), is a software service that has been described as a “Dropbox for big data.” It is broadly used in the scientific community. Through the use of dedicated data transfer nodes, users can easily transfer data within NCAR and to peer institutions through the use of Globus online web-based services or scripted data transfer mechanisms.

The data transfer services were moved to new nodes and to a new 40-gigabit Ethernet network environment in FY2016. The new nodes support Globus transfer services along with traditional gridFTP, scp, sftp, and bbcp services as well as access to the HPSS system for users without an HPC account.

Capstone: Data Analysis as a Service Platform

Although climate and weather information is vital to research and decision-making in a wide variety of societally important contexts, it is difficult for scientists, resource managers, and concerned citizens to access and share the expert knowledge required for analyzing and drawing conclusions from weather and climate data. Common barriers to effectively using climate data use are the time, resources, and expertise required to discover, access, and process climate data for analysis. Massaging data in this way is time consuming, requires special, discipline-specific knowledge, may involve protracted email conversations with busy human experts, and is potentially error prone and difficult to reproduce. Further, the requisite deep knowledge of these analysis techniques is often encoded in software, and not in the published literature.

To begin addressing these issues, we are pursuing an integrated data analysis system that will provide server-side analysis of both model output and observational data. Capstone has the ambitious goal of addressing the data analysis needs of a diverse set of users (e.g., climate scientists, researchers and engineers from other disciplines, resource managers, and citizen scientists) who may lack some of the necessary knowledge or resources to do so on their own. To do this, Capstone is being designed to offer remotely accessible, server-side computational services for data processing and analysis using a community-governed toolbox of (micro)services.

Capstone accomplishments in FY2016 include exploration and development of an initial cloud-based prototype by a SIParCS student team during summer 2016. This early work is based on a broad review of existing technologies and approaches, including those by peer research centers. An early implementation plan has been prepared, and we are pursuing funding and partnership opportunities.

Advanced Cooperative Arctic Data and Information Service (ACADIS)

ACADIS is a collaboration between CISL and NCAR’s Earth Observing Laboratory, the National Snow and Ice Data Center, and Unidata. ACADIS is a community data service that provides project data management planning, data archival, preservation, and access for all projects funded by NSF’s Arctic Science Program (ARC). CISL’s contributions to ACADIS include the ACADIS gateway, which provides an end-to-end service where NSF-supported data providers can publish their data collections and make them available to the broad community of researchers.

Accomplishments in FY2016 include refinement of an automated archive export and storage process using Amazon AWS services. The ACADIS data repository was successfully transitioned to management by NCEAS/NCEI in FY2016.

Community Data Portal (CDP)

The CDP offers a broad range of scientific data collections that includes observations, climate, atmospheric chemistry, space weather, field programs, models, analyses, and more. Roughly 2,200 registered CDP users are discovering, accessing, and using 8,000 collections representing over 6.5 terabytes of managed data holdings.

In FY2016 we researched, identified, and deployed prototype replacement technology for the CDP services. Technology direction decisions were based on input from active CDP data providers, prototype technology deployments, and the NCAR Data Stewardship Engineering Team (DSET). In FY2016 we continued to provide operational support, security upgrades, and critical bug fixes for the CDP services.

Chronopolis: Federated Digital Preservation over Space and Time

There is a critical and growing need to organize, preserve, and make accessible the increasing number of digital holdings that represent vital intellectual capital, much of which is precious and irreplaceable. Chronopolis is a strategic collaboration among the San Diego Supercomputing Center (SDSC, lead organization), NCAR/CISL, the University of California Library System, and the University of Maryland. It is aimed at developing national-scale digital preservation infrastructure that has the potential to broadly serve any community with digital assets for science, engineering, humanities, and more. In addition to community collections, Chronopolis CI is being used to provide digital preservation services for the ACADIS project.

In FY2016, CISL continued developing a new web-based dashboard tool for system monitoring and federation-wide reporting and capacity planning. We expanded our Chronopolis production node with a new 375 TB storage system and deployed a test node to support project performance testing goals. CISL continued to provide operational support of the NCAR storage node that currently manages 25 terabytes and over 2.3 million managed objects.

Funding

The Climate Simulation Data Gateway and CDP are 100% supported by NSF Core funding. The CMIP Analysis Project is funded by NSF Special funds. The Data Transfer Services and Data Storage Service are 100% Core funded. Chronopolis is supported by special funds from the Chronopolis project. We are actively pursuing funding for the Capstone project.